2008/11/06 Spoken Document Retrieval and Summarization 語音文件檢索與摘要 Berlin Chen ( 陳柏琳 ) Associate Professor Department of Computer Science & Information Engineering

2008/11/06

Spoken Document Retrieval and Summarization

語音文件檢索與摘要

Berlin Chen ( 陳柏琳 )Associate Professor

Department of Computer Science & Information EngineeringNational Taiwan Normal University

2

Current Research Interests

• Automatic Speech Recognition– Feature Extraction: discriminative and robust features– Acoustic Modeling: discriminative and unsupervised training– Language Modeling: beyond n-grams, discriminative and

unsupervised training– Search: acoustic look-ahead techniques

• Speech Retrieval, Summarization and Mining– Robust indexing mechanisms and trainable retrieval models– Speech summarization exploring probabilistic generative

frameworks (beyond heuristic approaches)

3

Outline

• Introduction

• Related Research Work and Applications

• Key Techniques Developed at NTNU

– Automatic Speech Recognition

– Spoken Document Retrieval

– Spoken Document Summarization

• Prototype Systems Developed at NTNU

• Conclusions and Future Work

4

Introduction (1/3)

• Multimedia (audio-visual contents) associated with speech is continuously growing and filling our computers, networks and lives– Such as broadcast news, lectures, shows, voice mails, (contact-

center) conversations, etc.– Speech is the most semantic (or information)-bearing

• On the other hand, speech is the primary and the most convenient means of communication between people– Speech provides a better (or natural) user interface in wireless

environments and especially on smaller hand-held devices

• Speech will be the key for multimedia information access in the near future

5

Introduction (2/3)

Information Extraction and Retrieval

(IE & IR)

Spoken Dialogues

Spoken DocumentUnderstanding and

Organization

MultimediaNetworkContent

NetworksUsers

˙

Named EntityExtraction

Segmentation

Topic Analysisand Organization

Summarization

Title Generation

InformationRetrieval

Two-dimensional Tree Structurefor Organized Topics

Chinese Broadcast News Archive

retrievalresults

titles,summaries

Input Query

User Instructions

Named EntityExtraction

Segmentation

Topic Analysisand Organization

Summarization

Title Generation

InformationRetrieval

Two-dimensional Tree Structurefor Organized Topics

Chinese Broadcast News Archive

retrievalresults

titles,summaries

Input Query

User Instructions

N

i

d

d

d

d

.

.

.

.2

1

K

k

T

T

T

T

.

.2

1

n

j

t

t

t

t

.

.2

1

idP ik dTP kj TtP

documents latent topics query

nj ttttQ ....21Q

N

i

d

d

d

d

.

.

.

.2

1

N

i

d

d

d

d

.

.

.

.2

1

K

k

T

T

T

T

.

.2

1

K

k

T

T

T

T

.

.2

1

n

j

t

t

t

t

.

.2

1

n

j

t

t

t

t

.

.2

1

idP ik dTP kj TtP

documents latent topics query

nj ttttQ ....21Q

Multimodal Interaction

Multimedia Content Processing

• Scenario for Multimedia Information Access Using Speech

6

Introduction (3/3)

• Organization and retrieval and of multimedia (or spoken) are much more difficult– Written text documents are better structured and easier to

browse through• Provided with titles and other structure information• Easily shown on the screen to glance through (with visual

perception)– Multimedia (Spoken) documents are just video (audio) signals

• Users cannot efficiently go through each one from the beginning to the end during browsing, even if the they are automatically transcribed by automatic speech recognition

• However, abounding speaker, emotion and scene information make them much more attractive than text

• Better approaches for efficient organization and retrieval of multimedia (spoken) documents are demanded

7

Related Research Work and Applications

• Continuous and substantial efforts have been paid to (multimedia) spoken document recognition, retireval and organization in the recent past– Informedia System at Carnegie Mellon Univ.– AT&T SCAN System– Rough’n’Ready System at BBN Technologies– SpeechBot Audio/Video Search System at HP Labs– IBM Spoken Document Retrieval for Call-Center Conversations,

Natural Language Call-Routing, Voicemail Retrieval – NTT Speech Communication Technology for Contact Centers– Google Voice Search– MIT Lecture Browser

8

Key Techniques (1/2)

• Automatic Speech Recognition (ASR)– Automatically convert speech signals into sequences of words or

other suitable units for further processing

• Spoken Document Segmentation– Automatically segment speech signals (or automatically transcrib

ed word sequences) into a set of documents (or short paragraphs) each of which has a central topic

• Named Entity Extraction from Spoken Documents – Personal names, organization names, location names, event nam

es– Very often out-of-vocabulary (OOV) words, difficult for recognition

• E.g., “ 蔡煌郎” , “ 九二共識” , etc.

• Audio Indexing and Retrieval– Robust representation of the spoken documents– Matching between (spoken) queries and spoken documents

Lee and Chen, “Spoken document understanding and organization,” IEEE Signal Processing Magazine 22 (5), Sept. 2005

9

Key Techniques (2/2)

• Summarization for Spoken Documents– Automatically generate a summary (in text or speech form) for each

spoken document or a set of topic-coherent documents

• Title Generation for Multi-media/Spoken Documents– Automatically generate a title (in text/speech form) for each short

document; i.e., a very concise summary indicating the themes of the documents

• Topic Analysis and Organization for Spoken Documents– Analyze the subject topics for (retrieved) documents– Organize the subject topics of documents into graphic structures for

efficient browsing

• Information Extraction for Spoken Documents– Extraction of key information such as who, when, where, what and

how for the information described by spoken documents

10

An Example System for Chinese Broadcast News (1/2)

• For example, a prototype system developed at NTU for efficient spoken document retrieval and browsing [R4]

11

• Users can browse spoken documents in top-down and bottom-up manners

http://sovideo.iis.sinica.edu.tw/NeGSST/Index.htm

An Example System for Chinese Broadcast News (2/2)

http://sovideo.iis.sinica.edu.tw/NeGSST/Index.htm

12

Automatic Speech Recognition (1/3)

• Large Vocabulary Continuous Speech Recognition (LVCSR)

– The speech signal is converted into a sequence of feature vectors– The pronunciation lexicon is structured as a tree– Due to the constraints of n-gram language modeling, a word’s

occurrence is dependent on its previous n-1 words

– Search through all possible lexical tree copies (word hypotheses) from the start time to the end time of the utterance to find the best sequence among the word hypotheses

O

121121 ininiiii wwwwPwwwwP

Ney, “Progress in dynamic programming Search for LVCSR,” Proceedings of the IEEE 88 (8), August 2000

WWO

OWW

W

W

PP

P

L

L

www

www

21

21

maxarg

maxarg

rank

*

13


• Robust and Discriminative Speech Feature Extraction– Polynomial-fit Histogram Equalization (PHEQ) Approaches for

obust speech feature extraction – Improved Linear Discriminant Analysis (LDA) for distriminative

speech feature extraction

• Acoustic Modeling– Lightly-Supervised Training of Acoustic Models – Data Selection for Discriminative Training of Acoustic Models (HM

Ms) • Dynamic Language Model Adaptation

– Minimum Word Error (MWE) Discriminative Training – Word Topical Mixture Models (WTMM)– Position Information for Language Modeling

• Linguistic Decoding– Syllable-level acoustic model look-ahead

Lin et al., “Exploring the use of speech features and their corresponding distribution characteristics for robust speech recognition,” to appear in IEEE Transactions on Audio, Speech and Language Processing

14


• Transcription of PTS (Taiwan) Broadcast News

根據最新但雨量統計一整天下來費雪以試辦兩個水庫的雨量分別是五十三公里和二十九公厘對水位上升幫助不大不過就業機會期間也多在夜間氣象局也針對中部以北及東北部地區發佈豪雨特報因此還是有機會增加積水區的降雨量此外氣象局也預測華航又有另一道鋒面通過水利署估計如果這波鋒面能帶來跟著會差不多的雨水那個北台灣的第二階段限水時間渴望見到五月以後公視新聞當時匯率採訪報導

根據最新的雨量統計一整天下來翡翠石門兩個水庫的雨量分別是五十三公厘和二十九公厘對水位上升幫助不大不過由於集水區降雨多在夜間氣象局也針對中部以北及東北部地區發布了豪雨特報因此還是有機會增加集水區的降雨量此外氣象局也預測八號又有另一道鋒面通過水利署估計如果這波鋒面能帶來跟這回差不多的雨水那麼北台灣的第二階段限水時間可望延到五月以後公視新聞張玉菁陳柏諭採訪報導

Automatic

Manual

10 Iterations of ML training

Relative Character Error Rate Reduction - MMI: 6.5% - MPE: 10.1% - MPE+FS: 11.6%

20.5

21

21.5

22

22.5

23

23.5

0 1 2 3 4 5 6Iteration

CER

(%)

MMI

MPE

MPE+FS

15

Discriminative Training of Acoustic Models

• Training Objective: Minimum Phone Error (MPE)

• Selection of more critical phone arcs from the word lattice

r riW

Wkkr

iir

r riW

riMPE

WWAWPWOp

WPWOp

WWAOWpF

ri

rk

ri

),()()|(

)()|(

),()|()(

WW

W

qc

T

SIL

I

I

HAVE

MOVE

HAVE

VARY

VEAL

IT

VARY

OFTEN

OFTEN

SIL

FINE

SIL

FAST

avgc

rW

Liu et al., "Investigating data selection for minimum phone error training of acoustic models,“ IEEE ICME2007Liu et al., "Training data selection for improving discriminative training of acoustic models,“ IEEE ASRU2007

16

Categorization of Spoken Document Search Tasks

• Spoken Document Retrieval (SDR)– Find spoken documents that are “relevant” to a given query– Queries usually are long topic descriptions – Exploit LVCSR and text IR technologies – For broadcast news tasks, even with word error rate (WER) of m

ore than 30%, retrieval using automatic transcripts are comparable to that using reference transcripts

• Spoken Term Detection (STD)– Much like Web-style search, have drawn much attention recently

in the speech processing community– Queries are usually short (1-3 words), and find the “matched” do

cuments where all query terms should be present– Then, relevance ranking are performed on the “matched” docum

ents– Usually dealt with by using robust audio indexing mechanisms

17

Spoken Term Detection: Robust Audio Indexing

• Use of 1-best speech recognition output as the transcription to be indexed is suboptimal due to the high WER, which is likely to lead to low recall

• Lattices or confusion networks do provide much better WER– E.g., Chelba et al. (2007 & 2008) proposed to use Position-Specific

Posterior Probability Lattices (PSPL) for audio indexing

Chelba, et al., “ Retrieval and browsing of spoken content,” IEEE Signal Processing Magazine 25 (3), May 2008

18

Information Retrieval Models

• Information retrieval (IR) models can be characterized by two different matching strategies– Literal term matching

• Match queries and documents in an index term space– Concept matching

• Match queries and documents in a latent semantic space

中共新一代空軍戰

力relevant ?

香港星島日報篇報導引述軍事觀察家的話表示，到二零零五年台灣將完全喪失空中優勢，原因是中國大陸戰機不論是數量或是性能上都將超越台灣，報導指出中國在大量引進俄羅斯先進武器的同時也得加快研發自製武器系統，目前西安飛機製造廠任職的改進型飛豹戰機即將部署尚未與蘇愷三十通道地對地攻擊住宅飛機，以督促遇到挫折的監控其戰機目前也已經取得了重大階段性的認知成果。根據日本媒體報導在台海戰爭隨時可能爆發情況之下北京方面的基本方針，使用高科技答應局部戰爭。因此，解放軍打算在二零零四年前又有包括蘇愷三十二期在內的兩百架蘇霍伊戰鬥機。

19

IR Models: Literal Term Matching (1/2)

• Vector Space Model (VSM)– Vector representations are used for queries and documents– Each dimension is associated with a index term (TF-IDF

weighting), describing the intra-/inter-document statistics between all terms and all documents (bag-of-words modeling)

– Cosine measure for query-document relevance

– VSM can be implemented with an inverted file structure for efficient document search (instead of exhaustive search)

x

y

ni Qi

ni ji

ni qiji

j

j

j

ww

ww

QD

QD

QDsim

12,1

2,

1 ,,

||||)(cosine

),(

jD

Q

20

IR Models: Literal Term Matching (2/2)

• Hidden Markov Model (HMM)– Also known as Language Model (LM) – A language model consisting of a set of n-gram distributions is

established for each document for predicting the query

– Models can be optimized by the Maximum Likelihood Estimation (MLE) or Minimum Classification Error (MCE) training criteria

– Such approaches do provide a potentially effective and theoretically attractive probabilistic framework for studying IR problems

Chen et al., “A discriminative HMM/N-gram-based retrieval approach for Mandarin spoken documents,” ACM Transactions on Asian Language Information Processing 3(2), June 2004

DiwP M

CiwP M

A (unigram) document model

Query

Lw....wwQ 21

1

Li CiDiD wPλwPλQP 1 1 MMM

21

MLE and MCE Estimation of Document Models

• MLE: Given a training set of query exemplars, maximize the likelihood of these query exemplars being generated by their corresponding relevant documents– E.g., the expectation-maximization (EM) update formula of the tr

ansition probability

• MCE: Maximize the difference between the normalized log-likelihood of the query exemplar generated by the the relevant documents and top-ranked irrelevant documents

Q

Q QR n

TrainSetQQR

TrainSetQ DocD Qq CnDn

Dn

DocQ

qPλqPλ

qPλ

λ to

to MM

M

1ˆ

'

'MlogmaxMlog

1),( D

DD QPQP

QDQE

D

D

D

22

IR Models: Concept Matching (1/3)

• Latent Semantic Analysis (LSA)– Start with a matrix ( ) describing the intra-/inter-document

statistics between all terms and all documents– Singular value decomposition (SVD) is then performed on the

matrix to project all term and document vectors onto a reduced latent topical space

– Matching between words/queries and documents can be carried out in this latent topical space

=

mxn

kxk

mxk

kxn

A Umxk

Σk VTkxm

w1

w2

wi

wm

D1 D2 Di Dn

w1

w2

wi

wm

D1 D2 Di Dn

dq

dqdqcoinedqsim

T

ˆˆ

ˆˆ)ˆ,ˆ(ˆ,ˆ

2

A

TVUA

Bellegarda, “Latent semantic mapping,” IEEE Signal Processing Magazine 22(5), Sept. 2005

23


• Probabilistic Latent Semantic Analysis (PLSA)– An probabilistic framework for the above topical approach

– Relevance measure is not obtained based on the frequency of a respective query term occurring in a document, but instead based on the frequency of the term and document in the latent topics

– A query and a document thus may have a high relevance score even if they do not share any terms in common

kjk

kk

kjkk

j

DTPQTP

DTPQTPDQR

22),(

=

D1 D2 Di Dn

mxn

kxk

mxk

rxn

PUmxk

Σk VTkxn

ki TwP

kTPw1

w2

wi

wm

kj TDP

ij DwP ,

w1

w2

wj

wm

D1 D2 Di Dn

kjkk

kiji TDPTPTwPDwP ,

Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, Vol. 42, 2001

24


• PLSA also can be viewed as an HMM model or a topical mixture model (TMM)

– Explicitly interpret the document as a mixture model used to predict the query, which can be easily related to the conventional HMM modeling approaches widely studied in speech processing community (topical distributions are tied among documents)

– Quite a few of theoretically attractive model training algorithms, thus, can be applied in supervised or unsupervised manners

1TwP i

jDTP 2

A document model

Query

Li wwwwQ ....21

jDTP 1

jK DTP

2TwP i

Ki TwP

L

i

K

kjkkij DTPTwPDQP

1 1

Chen, “Exploring the use of latent topical information for statistical Chinese spoken document retrieval,” Pattern Recognition Letters 27( 1), Jan. 2006

unigrams of K topics

25

IR Models: Evaluation

• Experiments were conducted on TDT2/TDT3 spoken document collections– TDT2 for parameter tuning/training, while TDT3 for evaluation– E.g., mean average precision (mAP) tested on TDT3

• HMM/PLSA/TMM are trained in a supervised manner

• Language modeling approaches (HMM/PLSA/TMM/ WTMM) are evidenced with significantly better results than that of conventional statistical approaches (VSM/LSA) in the spoken document retrieval (SDR) task evaluated here– Preliminary results also evidence that they are superior to the SVM

(Ranking-SVM) based approaches

VSM LSA HMM PLSA TMM(Supervised

PLSA)

WTMM

TD 0.6505 0.6440 0.7174 0.6882 0.7870 0.8009

SD 0.6216 0.6390 0.7156 0.6688 0.7937 0.7858

26

Spoken Document Summarization

• Spoken document summarization, aiming to generate a summary automatically for the spoken documents, is the key for better speech understanding and organization

• Extractive vs. Abstractive Summarization– Extractive summarization is to select a number of indicative

sentences or paragraphs from original document and concatenate them to form a summary

– Abstractive summarization is to rewrite a concise abstract that can reflect the key concepts of the document

• Extractive spoken document summarization has gained much more attention in the recent past

• Since it can retain more speaker and emotional information inherent in the spoken documents

27

SDS: Proposed Framework (1/3)

• A Probabilistic Generative Framework for Sentence Selection (Ranking)– Maximum a posteriori (MAP) criterion

• Sentence Generative Model, – Each sentence of the document as a probabilistic generative model– Language Model (LM), Sentence Topical Mixture Model (STMM) and

Word Topical Mixture Model (WTMM) are initially investigated

• Sentence Prior Distribution,– The sentence prior distribution may have to do with sentence

duration/position, correctness of sentence boundary, confidence score, prosodic information, etc. (e.g., they can be fused by the whole-sentence maximum entropy model)

ii

iii SPSDP

DP

SPSDPDSP

rank

iSDP

iSP

Chen et al., “A probabilistic generative framework for extractive broadcast news speech summarization,” to appear in IEEE Transactions on Audio, Speech and Language Processing

28


• A flowchart for the generative summarization framework

29


• Features explored for modeling the sentence prior probability

– E.g., estimation of the relevance feature of a given spoken sentence

1

)(top top

LL

DD

DD

SavgSimLl

ul

LuDDD

D ul

ul

D D

30

SDS: Evaluation

• Preliminary tests on 100 radio broadcast news stories collected in Taiwan (automatic transcripts with 14.17% character error rate) – ROUGE-2 measure was used to evaluate the performance levels

of different models

– Proposed models (LM-RT, WTMM) are consistently better than the other models at lower summarization ratios

• LM-RT and WTMM are trained in a pure unsupervised manner, without any document-summary relevance information (labeled data)

VSM MMR LSA DIM SIG SVM LM-RT WTMM

10% 0.3073 0.3073 0.3034 0.3187 0.3144 0.3425 0.3684 0.3836

20% 0.3188 0.3214 0.2926 0.3148 0.3259 0.3408 0.3696 0.3772

30% 0.3593 0.3678 0.3286 0.3383 0.3428 0.3719 0.3840 0.3728

50% 0.4485 0.4501 0.3906 0.4345 0.4666 0.4660 0.4884 0.4615

31

Spoken Document Organization (1/3)

• Each document is viewed as a TMM model to generate itself – Additional transitions between topical mixtures have to do with

the topological relationships between topical classes on a 2-D map

Two-dimensional Tree Structure

for Organized Topics

2

2

2

,exp

2

1,

lk

klTTdist

TTE

K

sks

klkl

TTE

TTETTP

1,

,

jLij wwwwD ....21

1TwP i jDTP 1

jK DTP

2TwP i

Ki TwP

jDTP 2

A document model

Document

K

k

K

llnklikii TwPTTPDTPDwP

1 1

jD

32


• Document models can be trained in an unsupervised way by maximizing the total log-likelihood of the document collection

• Initial evaluation results (conducted on the TDT2 collection)

– TMM-based approach is competitive to the conventional Self-Organization Map (SOM) approach

V

ijiji

n

jT DwPDwcL

11log,

Model Iterations distBet/distWithin

TMM 10 1.9165

SOM 100 2.0604

33


• Each topical class can be labeled by words selected using the following criterion

• An example map for international political news

n

ijkji

n

jjkji

ki

DTPDwc

DTPDwc

TwSig

1

1

1,

,

,

34

Prototype Systems Developed at NTNU (1/3)

• Spoken Document Retrieval

http://sdr.csie.ntnu.edu.tw

http://sdr.csie.ntnu.edu.tw/

35


• Speech Retrieval and Browsing of Digital Archives

36


• Speech-based Information Retrieval for ITS systems

– Projects supported by Taiwan Network Information Center (TWNIC)

Query: 我想找方圓五公里賣臭豆腐的店家

37

Conclusions and Future Work

• Multimedia information access (over the Web) using speech will be very promising in the near future– Speech is the key for multimedia understanding and organization– Several task domains still remain challenging

• Spoken document retrieval provides good assistance for companies, for instance, in– Contact (Call)-center conservations: monitor agent conduct and

customer satisfaction, increase service efficiency– Content-providing services: such as MOD (Multimedia on Demand):

provide a better way to retrieve and browse descried program contents

Special Issue on Spoken Language Technology, IEEE Signal Processing Magazine, 25 (3), May 2008

Thank You!

39

References

[R1] B. Chen et al., “A Discriminative HMM/N-Gram-Based Retrieval Approach for Mandarin Spoken Documents,” ACM Transactions on Asian Language Information Processing, Vol. 3, No. 2, June 2004

[R2] J.R. Bellegarda, “Latent Semantic Mapping,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005

[R3] K. Koumpis and S. Renals, “Content-based access to spoken audio,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005

[R4] L.S. Lee and B. Chen, “Spoken Document Understanding and Organization,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005

[R5] T. Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, Vol. 42, 2001

[R6] B. Chen, “Exploring the Use of Latent Topical Information for Statistical Chinese Spoken Document Retrieval,” Pattern Recognition Letters, Vol. 27, No. 1, Jan. 2006

[R7] B. Chen, Y.T. Chen, "Extractive Spoken Document Summarization for Information Retrieval," Pattern Recognition Letters, Vol. 29, No. 4, March 2008

40

The Informedia System at CMU

• Video Analysis and Content Extraction (VACE)– http://www.informedia.cs.cmu.edu/

http://www.informedia.cs.cmu.edu/

41

AT&T SCAN System

• SCAN: Speech Content Based Audio Navigator (1999)

Design and evaluate user interfaces to support retrieval

from speech archives

42

BBN Rough’n’Ready System (1/2)

• Distinguished Architecture for Audio Indexing and Retrieval (2002)

43

BBN Rough’n’Ready System (2/2)

• Automatic Structural Summarization for Broadcast News

44

SpeechBot Audio/Video Search System at HP Labs

• An experimental Web-based tool from HP Labs that used voice-recognition to create seachable keyword transcripts from thousands of hours of audio content

45

NTT Speech Communication Technology for Contact Centers

– CSR: Customer Service Representative

46

Google Voice Search (1/2)

• Google Voice Local Search

http://www.google.com/goog411/

C:\Berlin\DOCS\Talks\DemoVideo\Good411.mpg

47

Google Voice Search (2/2)

• Google Audio Indexing– Searching what people are saying inside YouTube videos (curre

ntly only for what the politicians are saying)

http://labs.google.com/gaudi

48

MIT Lecture Browser

• Retrieval and browsing of academic lectures of various categories

49

Noise Interference (1/2)

• Speech recognition performance degrades seriously when– There is mismatch between training and test acoustic conditions – However, training systems for all noise conditions is impractical

• A Simplified Distortion Framework

– Channel effects are usually assumed to be constant while uttering– Additive noises can be either stationary or non-stationary

(Convolutional Noise)

Channel Effect

Background Noise

(Additive Noise)

Noisy Speech

Clean Speech

tx th tn

tnthtxty

50

• Additive Environmental Noises Cause Non-linear Distortions in Speech Feature Space (due to logarithmic operations in feature extraction)

– When clean speech was corrupted by 10dB subway noise • Non-linear distortion effects were observed

Noise Interference (2/2)

1log10)( 10 10 N

s

N

s

E

E

E

EdB

1

0

*1 T

ns nsns

TE signal energy measured in time domain

51

Categorization of Robustness Approaches (1/4)

• Model Compensation– Adapt acoustic models to match corrupted speech features

• Feature Compensation– Restore corrupted speech features to corresponding clean ones

Clean Speech

Noisy Speech

Clean Acoustic Models

NoisyAcoustic Models

TrainingConditions

TestConditions

Feature Compensation

ModelCompensation

Feature Space Model Space

52


• Model Compensation– Adapt acoustic models to match corrupted speech features

– Representative approaches: MAP, MLLR, MAPLR, etc.

• Feature Compensation– Restore corrupted speech features to corresponding clean ones

– Representative approaches: SS, CMN, CMVN, etc.

ModelCompensation

m

m

m

m

Clean Models Corrupted Models

Corrupted Speech

Corrupted Speech Features

FeatureCompensation

Compensation Method

CleanSpeech Features

53


• Or alternatively, they can be classified into two categories according to whether the orientation of the methods is either from the feature domain or from the corresponding noise-resistant distribution characteristics

– Each of them has their own advantages and limitations– Approaches from these two categories could complement each

other

54


Noisy Speech

Feature DomainStatistical Distribution

Cepstral Mean Normalization[Furui 1981]

Cepstral Mean/Variance Normalization[Vikki et al. 1998]

High Order Cepstral Moment Normalization

[Hsu et al. 2006]

Histogram Equalization[Molau 2003; Torre et al. 2005]

Quantile-based Histogram

Equalization[Hilger et al. 2006]

Probabilistic Optimum Filtering[Neumeyer et al. 1994]

Codeword Dependent Cepstral Normalization

[Acero 1990]

Feature Transformation

Heteroscedastic Linear Discriminant Analysis[Kumar 1997]

Kernel Linear Discriminant Analysis[Mika 1999]

Principle Component Analysis

Linear Discriminant Analysis

[Duda et al. 1973]

Heteroscedastic Discriminant Analysis[Saon et al. 2000]

Feature Compensation

Stereo-based Piecewise Linear Compensation

[Deng et al. 2000]

Stochastic Matching[Sankar et al. 1994]

Feature Normalization

Maximum Likelihood Stochastic Vector Mapping

[Wu et al. 2005]

Discriminative Stochastic Vector Mapping

Minimum Classification Error[Wu et al. 2006]

Maximum Mutual Information[Droppo et al. 2005]

Feature Reconstruction

Missing FeatureCluster-based[Raj et al. 2004]

Covariance-based[Raj et al. 2004]

55

Roots of Histogram Equalization (HEQ)

• Two basic assumptions of HEQ– Each feature vector dimension can be normalized independently of

each other (i.e., in a dimension-wise manner)– The effect of noise distortions is a monotonic (or order-preserving),

non-linear transformation of the speech feature space• The transformed speech feature distributions of the test (or

noisy) data should be identical to that of the training (or reference) data

1.0

xCTest

1.0 yCTrain x

y

Transformation Function

CD

F of

Test

Speech

CD

F of

Reference

Speech

Y

X

-3.1, -2.5, -0.3, 1.7, 3.2, 7.6

-2.6, -1.3, 0.2, 1.9, 4.3, 7.8

-2.6, 0.2, 0.3, 1.9, 4.3, 7.8

monotonic (but nonlinear) transformation

random behavior

56

Table-lookup Histogram Equalization (THEQ) (1/2)

• Due to a finite number of speech features being considered, the cumulative histograms are used instead of the CDFs– Table-lookup Histogram Equalization (THEQ)

• THEQ– A table consisting information of {(Quantile i, Restored Feature

Values)} is constructed during the training phase– To achieve better performance, the table size cannot be too

small• The needs of huge disk storage consumption• Table-lookup is also time-consuming

57

Table-lookup Histogram Equalization (THEQ) (2/2)

• Schematic diagram of THEQ

58

Quantile-based Histogram Equalization (QHEQ)

• QHEQ attempts to calibrate the CDF of each feature vector component of the test data to that of training data in a quantile-corrective manner– Instead of full matching of cumulative histograms

• A parametric transformation function is used

• For each sentence, the optimum parameters and should be obtained from the quantile correction step

– Exhaustive online grid search is required: time-consuming

KKK Q

x

Q

xQxH

1

1

1

2

,minarg,

K

k

trainkk QQH

Tpospos

T

xx

xx

,,

,,

1

1

KQkQ1Q

59

Polynomial-Fit Histogram Equalization (PHEQ) (1/2)

• We propose to use least squares regression for fitting the inverse function of CDFs of training speech– For each speech feature vector dimension of the training data, a

polynomial function can be expressed as follows, given a pair of and corresponding CDF

– The corresponding squared error

– Coefficients can be estimated by minimizing the squared error

M

m

miTrainmiiTrain yCayyCG

0

~

N

i

M

m

miTrainmi yCayE

1

2

0

2'

iTrain yCiy

ma

60

Polynomial-Fit Histogram Equalization (PHEQ) (2/2)

• Estimation of CDF Values– For each feature vector dimension, , the CDF

value of each frame can be estimated using the following steps • are sorted in an ascending order• The corresponding CDF value of each frame is approximated by

– Where is an indication function, indicating the position of in the sorted data

– During recognition• The CDF value of each test frame is estimated and taken

as the input to the corresponding inverse function G to obtain a restored feature component

NN yyyY ,, 21,1

N

ySyC iposi

NY ,1

ipos yS

iyC

iy

61

Spatial-Temporal Feature Distribution Characteristics (STDC) (1/2)

• The feature vector components of different dimensions are not completely de-correlated

• The effect of noise distortion may not be a monotonic transformation of the speech feature space– However, speech feature distribution characteristics are robust

when facing noise interferences

• Speech signals are slowly time-varying, the contextual relationships of consecutive speech frames might provide additional information clues for feature normalization

• We therefore propose a feature-distribution-feature transformation incorporating spatial-temporal information clues for speech compensation

62

Spatial-Temporal Feature Distribution Characteristics (STDC) (2/2)

• Schematic diagram of STDC

transformation matrix

lT

l SX Aˆ

A

restored feature vector lX

feature vector sequence

corresponding CDF vectorsequence

lX 1lX KlX 1lX1lX

lC 1lC KlC 1lC1lC

lSspliced CDF vector

lT

l SX Aˆ

N

lll XXE

1

22 ˆ

63

Comparison of Various HEQ Approaches

Method Storage

Requirement Computational

Complexity

THEQLarge - depending on the

number of reference pairs kept in the look-up table

Medium - depending on the look-up table size for

searching the corresponding restored value

QHEQSmall - depending on the

number of quantiles for quantile-correction

High - depending on the value ranges and the resolutions

of parameters for online grid search

PHEQSmall - depending on the order

of the polynomial functionsLow - depending on the order of

the polynomial function

STDCSmall - depending on the frame

number used for transformation

Low - depending on the frame number used for transformation

64

Experimental Results

• The speech recognition experiments were conducted under various noise conditions using the Aurora-2 database and task (clean-condition training) – word rate error results

Method Set A Set B Set C Average

MFCC 41.06 41.52 40.03 41.04

THEQ 22.76 21.16 23.47 22.47

QHEQ 23.53 21.90 22.36 22.64

PHEQ 20.98 20.17 21.43 20.75

STDC 18.02 16.98 18.97 17.99

MVA 18.38 16.14 21.81 18.17

ETSI 38.69 44.25 28.76 38.93

CMVN 27.73 24.60 27.17 26.37

HLDA-CMVN 21.63 21.37 21.59 21.52

65

Word Topical Mixture Model (1/4)

• In this research, each word of language are treated as a word topical mixture model (WTMM) for predicting the occurrences of other words

• WTMM in LM Adaptation– Each history consists of words– History model is treated as a composite word TMM – The history model of a decoded word can be dynamically

constructed

K

kwkkiwi jj

MTPTwPMwP1

|||

K

kHkki

K

k

i

jwkjki

i

j

K

kwkkijwi

i

jjwi

jwj

jji

MTPTwPMTPTwP

MTPTwPMwPHwP

1

'

1

1

1

1

1 1

1

1

||||

||||

66


• Exploration of Training Exemplars– Collect the words within a context window around each

occurrence of word in the training corpus – Concatenate them to form the relevant observations for training

the word TMM

– Maximize the sum of log-likelihoods of WTMM models generating their corresponding training exemplars

TrainSetjw jwn

jj

TrainSetjw

jjTrainSetQ Qw

wnwnQ

ww MwPQwnMQPLQQ

Q |log,|loglog

jw

1,jwQ 2,jw

Q Nw jQ ,

Nwwww jjjjQQQQ ,2,1, ,,,

jw jw

67


• Recognition using WTMM models– A simple linear combination of WTMM models of the words occur

ring in the search history

– Weights are empirically set to be exponentially decayed as the words in the history are apart from current decoded word

1wi MwP

1

2wi MwP

1iwi MwP

2

1i

A decoded word

iw

A composite word TMM model for the

search history 121 ,,, iiwwwwH

68


• Experiments: Comparison of WTMM, PLSALM, TBLM

520.31510.26 507.13

499.3

540.52527.82

519.71514.34

499.63

481.86

467.62

501.01

533.07

559.84

627.59614.7

19.5519.6919.7619.8

19.9920.06

20.13

19.95

20.0720.0920.13

19.99

20.69 20.63 20.63

20.25

450

470

490

510

530

550

570

590

610

630

650

16 topics/50K pairs 32 topics/80Kpairs 64 topics/400Kpairs128 topics/900K pairs

PP

19.0

19.2

19.4

19.6

19.8

20.0

20.2

20.4

20.6

20.8

CE

R(%

)

WTMMPLSATBLM_MITBLM_TFIDF

WTMM performs slightly better than PLSALM and TBLM in CER measure

TBLM trained with MI score performs better in PP (perplexity) measure

69

SDS: Other Approaches (1/4)

• Vector Space Model (VSM)– Vector representations of sentences and the document to be

summarized using statistical weighting such as TF-IDF– Sentences are ranked based on their proximity to the document– To summarize more important and different concepts in a

document• The terms occurring in the sentence with the highest

relevance score are removed from the document• The document vector is then reconstructed and the ranking of

the rest of the sentences is performed accordingly

• Or, using the Maximum Marginal Relevance (MMR) model

– : the set of already selected sentences

SummSSimDSSimNextSen lilSl

,1,max

il DSSim ,

Y. Gong, SIGIR 2001

Summ

70


• Latent Semantic Analysis (LSA)– Construct a “term-sentence” matrix for a given document– Perform SVD on the “term-sentence” matrix

• The right singular vectors with larger singular values represent the dimensions of the more important latent semantic concepts in the document

• Represent each sentence of a document as a vector in the latent semantic space

– Sentences with the largest index (element) values in each of the top L right singular vectors are included in the summary (LSA-1)

S1 S2 S3SN

w1w2

w3

wM

w1w2

w3

wM

S1 S2 S3 SN

rirv

L

rirri vSScore

1

2

Sentences also can be selected based on their norms (LSA-2):

Hirohata et al., ICASSP2005

U

Σ tV

Y. Gong, SIGIR 2001

71


• Sentence Significance Score (SenSig)– Sentences are ranked based on their significance which, for exa

mple, is defined by the average importance scores of words in the sentence

– Other features such as word confidence, linguistic score, or prosodic information also can be further integrated into this method

SN

nn

S

wIN

SSenSig1

1

w

Cwwn F

FficffwI log

similar to TF-IDF weighting

S. Furui et al., IEEE SAP 12(4), 2004

72


• Classification-Based Methods– Sentence selection is formulated as a binary classification

problem– A sentence can either be included in a summary or not

• A bulk of classification-based methods using statistical features also have been developed– Gaussian mixture models (GMM)– Bayesian network classifier (BN) – Support vector machine (SVM)– Logistic Regression (LR)

• However, the above methods need a set of training documents together with their corresponding handcrafted summaries (or labeled data) for training the classifiers