Upload
gary-long
View
236
Download
1
Embed Size (px)
Citation preview
2008/11/06
Spoken Document Retrieval and Summarization
語音文件檢索與摘要
Berlin Chen ( 陳柏琳 )Associate Professor
Department of Computer Science & Information EngineeringNational Taiwan Normal University
2
Current Research Interests
• Automatic Speech Recognition– Feature Extraction: discriminative and robust features– Acoustic Modeling: discriminative and unsupervised training– Language Modeling: beyond n-grams, discriminative and
unsupervised training– Search: acoustic look-ahead techniques
• Speech Retrieval, Summarization and Mining– Robust indexing mechanisms and trainable retrieval models– Speech summarization exploring probabilistic generative
frameworks (beyond heuristic approaches)
3
Outline
• Introduction
• Related Research Work and Applications
• Key Techniques Developed at NTNU
– Automatic Speech Recognition
– Spoken Document Retrieval
– Spoken Document Summarization
• Prototype Systems Developed at NTNU
• Conclusions and Future Work
4
Introduction (1/3)
• Multimedia (audio-visual contents) associated with speech is continuously growing and filling our computers, networks and lives– Such as broadcast news, lectures, shows, voice mails, (contact-
center) conversations, etc.– Speech is the most semantic (or information)-bearing
• On the other hand, speech is the primary and the most convenient means of communication between people– Speech provides a better (or natural) user interface in wireless
environments and especially on smaller hand-held devices
• Speech will be the key for multimedia information access in the near future
5
Introduction (2/3)
Information Extraction and Retrieval
(IE & IR)
Spoken Dialogues
Spoken DocumentUnderstanding and
Organization
MultimediaNetworkContent
NetworksUsers
˙
Named EntityExtraction
Segmentation
Topic Analysisand Organization
Summarization
Title Generation
InformationRetrieval
Two-dimensional Tree Structurefor Organized Topics
Chinese Broadcast News Archive
retrievalresults
titles,summaries
Input Query
User Instructions
Named EntityExtraction
Segmentation
Topic Analysisand Organization
Summarization
Title Generation
InformationRetrieval
Two-dimensional Tree Structurefor Organized Topics
Chinese Broadcast News Archive
retrievalresults
titles,summaries
Input Query
User Instructions
N
i
d
d
d
d
.
.
.
.2
1
K
k
T
T
T
T
.
.2
1
n
j
t
t
t
t
.
.2
1
idP ik dTP kj TtP
documents latent topics query
nj ttttQ ....21Q
N
i
d
d
d
d
.
.
.
.2
1
N
i
d
d
d
d
.
.
.
.2
1
K
k
T
T
T
T
.
.2
1
K
k
T
T
T
T
.
.2
1
n
j
t
t
t
t
.
.2
1
n
j
t
t
t
t
.
.2
1
idP ik dTP kj TtP
documents latent topics query
nj ttttQ ....21Q
Multimodal Interaction
Multimedia Content Processing
• Scenario for Multimedia Information Access Using Speech
6
Introduction (3/3)
• Organization and retrieval and of multimedia (or spoken) are much more difficult– Written text documents are better structured and easier to
browse through• Provided with titles and other structure information• Easily shown on the screen to glance through (with visual
perception)– Multimedia (Spoken) documents are just video (audio) signals
• Users cannot efficiently go through each one from the beginning to the end during browsing, even if the they are automatically transcribed by automatic speech recognition
• However, abounding speaker, emotion and scene information make them much more attractive than text
• Better approaches for efficient organization and retrieval of multimedia (spoken) documents are demanded
7
Related Research Work and Applications
• Continuous and substantial efforts have been paid to (multimedia) spoken document recognition, retireval and organization in the recent past– Informedia System at Carnegie Mellon Univ.– AT&T SCAN System– Rough’n’Ready System at BBN Technologies– SpeechBot Audio/Video Search System at HP Labs– IBM Spoken Document Retrieval for Call-Center Conversations,
Natural Language Call-Routing, Voicemail Retrieval – NTT Speech Communication Technology for Contact Centers– Google Voice Search– MIT Lecture Browser
8
Key Techniques (1/2)
• Automatic Speech Recognition (ASR)– Automatically convert speech signals into sequences of words or
other suitable units for further processing
• Spoken Document Segmentation– Automatically segment speech signals (or automatically transcrib
ed word sequences) into a set of documents (or short paragraphs) each of which has a central topic
• Named Entity Extraction from Spoken Documents – Personal names, organization names, location names, event nam
es– Very often out-of-vocabulary (OOV) words, difficult for recognition
• E.g., “ 蔡煌郎” , “ 九二共識” , etc.
• Audio Indexing and Retrieval– Robust representation of the spoken documents– Matching between (spoken) queries and spoken documents
Lee and Chen, “Spoken document understanding and organization,” IEEE Signal Processing Magazine 22 (5), Sept. 2005
9
Key Techniques (2/2)
• Summarization for Spoken Documents– Automatically generate a summary (in text or speech form) for each
spoken document or a set of topic-coherent documents
• Title Generation for Multi-media/Spoken Documents– Automatically generate a title (in text/speech form) for each short
document; i.e., a very concise summary indicating the themes of the documents
• Topic Analysis and Organization for Spoken Documents– Analyze the subject topics for (retrieved) documents– Organize the subject topics of documents into graphic structures for
efficient browsing
• Information Extraction for Spoken Documents– Extraction of key information such as who, when, where, what and
how for the information described by spoken documents
10
An Example System for Chinese Broadcast News (1/2)
• For example, a prototype system developed at NTU for efficient spoken document retrieval and browsing [R4]
11
• Users can browse spoken documents in top-down and bottom-up manners
http://sovideo.iis.sinica.edu.tw/NeGSST/Index.htm
An Example System for Chinese Broadcast News (2/2)
12
Automatic Speech Recognition (1/3)
• Large Vocabulary Continuous Speech Recognition (LVCSR)
– The speech signal is converted into a sequence of feature vectors– The pronunciation lexicon is structured as a tree– Due to the constraints of n-gram language modeling, a word’s
occurrence is dependent on its previous n-1 words
– Search through all possible lexical tree copies (word hypotheses) from the start time to the end time of the utterance to find the best sequence among the word hypotheses
O
121121 ininiiii wwwwPwwwwP
Ney, “Progress in dynamic programming Search for LVCSR,” Proceedings of the IEEE 88 (8), August 2000
WWO
OWW
W
W
PP
P
L
L
www
www
21
21
maxarg
maxarg
rank
*
13
Automatic Speech Recognition (2/3)
• Robust and Discriminative Speech Feature Extraction– Polynomial-fit Histogram Equalization (PHEQ) Approaches for
obust speech feature extraction – Improved Linear Discriminant Analysis (LDA) for distriminative
speech feature extraction
• Acoustic Modeling– Lightly-Supervised Training of Acoustic Models – Data Selection for Discriminative Training of Acoustic Models (HM
Ms) • Dynamic Language Model Adaptation
– Minimum Word Error (MWE) Discriminative Training – Word Topical Mixture Models (WTMM)– Position Information for Language Modeling
• Linguistic Decoding– Syllable-level acoustic model look-ahead
Lin et al., “Exploring the use of speech features and their corresponding distribution characteristics for robust speech recognition,” to appear in IEEE Transactions on Audio, Speech and Language Processing
14
Automatic Speech Recognition (3/3)
• Transcription of PTS (Taiwan) Broadcast News
根據 最新 但 雨量 統計一 整天 下來費 雪 以 試辦 兩 個 水庫 的 雨量分別 是 五十 三 公里 和 二十 九 公厘對 水位 上升 幫助 不 大不過 就業機會 期間 也 多 在 夜間氣象局 也 針對 中部以北 及 東北部 地區 發佈 豪雨 特 報因此 還是 有 機會 增加 積水 區 的 降雨量此外 氣象局 也 預測華航 又 有 另 一道 鋒面 通過水利署 估計 如果 這 波 鋒面 能 帶來 跟著 會 差不多 的 雨水那 個 北 台灣 的 第二 階段 限水 時間渴望 見到 五月 以後公視 新聞 當時 匯率 採訪 報導
根據最新的雨量統計一整天下來翡翠石門兩個水庫的雨量分別是五十三公厘和二十九公厘對水位上升幫助不大不過由於集水區降雨多在夜間氣象局也針對中部以北及東北部地區發布了豪雨特報因此還是有機會增加集水區的降雨量此外氣象局也預測八號又有另一道鋒面通過水利署估計如果這波鋒面能帶來跟這回差不多的雨水那麼北台灣的第二階段限水時間可望延到五月以後公視新聞張玉菁陳柏諭採訪報導
Automatic
Manual
10 Iterations of ML training
Relative Character Error Rate Reduction - MMI: 6.5% - MPE: 10.1% - MPE+FS: 11.6%
20.5
21
21.5
22
22.5
23
23.5
0 1 2 3 4 5 6Iteration
CER
(%)
MMI
MPE
MPE+FS
15
Discriminative Training of Acoustic Models
• Training Objective: Minimum Phone Error (MPE)
• Selection of more critical phone arcs from the word lattice
r riW
Wkkr
iir
r riW
riMPE
WWAWPWOp
WPWOp
WWAOWpF
ri
rk
ri
),()()|(
)()|(
),()|()(
WW
W
qc
T
SIL
I
I
HAVE
MOVE
HAVE
VARY
VEAL
IT
VARY
OFTEN
OFTEN
SIL
FINE
SIL
FAST
avgc
rW
Liu et al., "Investigating data selection for minimum phone error training of acoustic models,“ IEEE ICME2007Liu et al., "Training data selection for improving discriminative training of acoustic models,“ IEEE ASRU2007
16
Categorization of Spoken Document Search Tasks
• Spoken Document Retrieval (SDR)– Find spoken documents that are “relevant” to a given query– Queries usually are long topic descriptions – Exploit LVCSR and text IR technologies – For broadcast news tasks, even with word error rate (WER) of m
ore than 30%, retrieval using automatic transcripts are comparable to that using reference transcripts
• Spoken Term Detection (STD)– Much like Web-style search, have drawn much attention recently
in the speech processing community– Queries are usually short (1-3 words), and find the “matched” do
cuments where all query terms should be present– Then, relevance ranking are performed on the “matched” docum
ents– Usually dealt with by using robust audio indexing mechanisms
17
Spoken Term Detection: Robust Audio Indexing
• Use of 1-best speech recognition output as the transcription to be indexed is suboptimal due to the high WER, which is likely to lead to low recall
• Lattices or confusion networks do provide much better WER– E.g., Chelba et al. (2007 & 2008) proposed to use Position-Specific
Posterior Probability Lattices (PSPL) for audio indexing
Chelba, et al., “ Retrieval and browsing of spoken content,” IEEE Signal Processing Magazine 25 (3), May 2008
18
Information Retrieval Models
• Information retrieval (IR) models can be characterized by two different matching strategies– Literal term matching
• Match queries and documents in an index term space– Concept matching
• Match queries and documents in a latent semantic space
中共新一代空軍戰
力relevant ?
香港星島日報篇報導引述軍事觀察家的話表示,到二零零五年台灣將完全喪失空中優勢,原因是中國大陸戰機不論是數量或是性能上都將超越台灣,報導指出中國在大量引進俄羅斯先進武器的同時也得加快研發自製武器系統,目前西安飛機製造廠任職的改進型飛豹戰機即將部署尚未與蘇愷三十通道地對地攻擊住宅飛機,以督促遇到挫折的監控其戰機目前也已經取得了重大階段性的認知成果。根據日本媒體報導在台海戰爭隨時可能爆發情況之下北京方面的基本方針,使用高科技答應局部戰爭。因此,解放軍打算在二零零四年前又有包括蘇愷三十二期在內的兩百架蘇霍伊戰鬥機。
19
IR Models: Literal Term Matching (1/2)
• Vector Space Model (VSM)– Vector representations are used for queries and documents– Each dimension is associated with a index term (TF-IDF
weighting), describing the intra-/inter-document statistics between all terms and all documents (bag-of-words modeling)
– Cosine measure for query-document relevance
– VSM can be implemented with an inverted file structure for efficient document search (instead of exhaustive search)
x
y
ni Qi
ni ji
ni qiji
j
j
j
ww
ww
QD
QD
QDsim
12,1
2,
1 ,,
||||)(cosine
),(
jD
Q
20
IR Models: Literal Term Matching (2/2)
• Hidden Markov Model (HMM)– Also known as Language Model (LM) – A language model consisting of a set of n-gram distributions is
established for each document for predicting the query
– Models can be optimized by the Maximum Likelihood Estimation (MLE) or Minimum Classification Error (MCE) training criteria
– Such approaches do provide a potentially effective and theoretically attractive probabilistic framework for studying IR problems
Chen et al., “A discriminative HMM/N-gram-based retrieval approach for Mandarin spoken documents,” ACM Transactions on Asian Language Information Processing 3(2), June 2004
DiwP M
CiwP M
A (unigram) document model
Query
Lw....wwQ 21
1
Li CiDiD wPλwPλQP 1 1 MMM
21
MLE and MCE Estimation of Document Models
• MLE: Given a training set of query exemplars, maximize the likelihood of these query exemplars being generated by their corresponding relevant documents– E.g., the expectation-maximization (EM) update formula of the tr
ansition probability
• MCE: Maximize the difference between the normalized log-likelihood of the query exemplar generated by the the relevant documents and top-ranked irrelevant documents
Q
Q QR n
TrainSetQQR
TrainSetQ DocD Qq CnDn
Dn
DocQ
qPλqPλ
qPλ
λ to
to MM
M
1ˆ
'
'MlogmaxMlog
1),( D
DD QPQP
QDQE
D
D
D
22
IR Models: Concept Matching (1/3)
• Latent Semantic Analysis (LSA)– Start with a matrix ( ) describing the intra-/inter-document
statistics between all terms and all documents– Singular value decomposition (SVD) is then performed on the
matrix to project all term and document vectors onto a reduced latent topical space
– Matching between words/queries and documents can be carried out in this latent topical space
=
mxn
kxk
mxk
kxn
A Umxk
Σk VTkxm
w1
w2
wi
wm
D1 D2 Di Dn
w1
w2
wi
wm
D1 D2 Di Dn
dq
dqdqcoinedqsim
T
ˆˆ
ˆˆ)ˆ,ˆ(ˆ,ˆ
2
A
TVUA
Bellegarda, “Latent semantic mapping,” IEEE Signal Processing Magazine 22(5), Sept. 2005
23
IR Models: Concept Matching (2/3)
• Probabilistic Latent Semantic Analysis (PLSA)– An probabilistic framework for the above topical approach
– Relevance measure is not obtained based on the frequency of a respective query term occurring in a document, but instead based on the frequency of the term and document in the latent topics
– A query and a document thus may have a high relevance score even if they do not share any terms in common
kjk
kk
kjkk
j
DTPQTP
DTPQTPDQR
22),(
=
D1 D2 Di Dn
mxn
kxk
mxk
rxn
PUmxk
Σk VTkxn
ki TwP
kTPw1
w2
wi
wm
kj TDP
ij DwP ,
w1
w2
wj
wm
D1 D2 Di Dn
kjkk
kiji TDPTPTwPDwP ,
Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, Vol. 42, 2001
24
IR Models: Concept Matching (3/3)
• PLSA also can be viewed as an HMM model or a topical mixture model (TMM)
– Explicitly interpret the document as a mixture model used to predict the query, which can be easily related to the conventional HMM modeling approaches widely studied in speech processing community (topical distributions are tied among documents)
– Quite a few of theoretically attractive model training algorithms, thus, can be applied in supervised or unsupervised manners
1TwP i
jDTP 2
A document model
Query
Li wwwwQ ....21
jDTP 1
jK DTP
2TwP i
Ki TwP
L
i
K
kjkkij DTPTwPDQP
1 1
Chen, “Exploring the use of latent topical information for statistical Chinese spoken document retrieval,” Pattern Recognition Letters 27( 1), Jan. 2006
unigrams of K topics
25
IR Models: Evaluation
• Experiments were conducted on TDT2/TDT3 spoken document collections– TDT2 for parameter tuning/training, while TDT3 for evaluation– E.g., mean average precision (mAP) tested on TDT3
• HMM/PLSA/TMM are trained in a supervised manner
• Language modeling approaches (HMM/PLSA/TMM/ WTMM) are evidenced with significantly better results than that of conventional statistical approaches (VSM/LSA) in the spoken document retrieval (SDR) task evaluated here– Preliminary results also evidence that they are superior to the SVM
(Ranking-SVM) based approaches
VSM LSA HMM PLSA TMM(Supervised
PLSA)
WTMM
TD 0.6505 0.6440 0.7174 0.6882 0.7870 0.8009
SD 0.6216 0.6390 0.7156 0.6688 0.7937 0.7858
26
Spoken Document Summarization
• Spoken document summarization, aiming to generate a summary automatically for the spoken documents, is the key for better speech understanding and organization
• Extractive vs. Abstractive Summarization– Extractive summarization is to select a number of indicative
sentences or paragraphs from original document and concatenate them to form a summary
– Abstractive summarization is to rewrite a concise abstract that can reflect the key concepts of the document
• Extractive spoken document summarization has gained much more attention in the recent past
• Since it can retain more speaker and emotional information inherent in the spoken documents
27
SDS: Proposed Framework (1/3)
• A Probabilistic Generative Framework for Sentence Selection (Ranking)– Maximum a posteriori (MAP) criterion
• Sentence Generative Model, – Each sentence of the document as a probabilistic generative model– Language Model (LM), Sentence Topical Mixture Model (STMM) and
Word Topical Mixture Model (WTMM) are initially investigated
• Sentence Prior Distribution,– The sentence prior distribution may have to do with sentence
duration/position, correctness of sentence boundary, confidence score, prosodic information, etc. (e.g., they can be fused by the whole-sentence maximum entropy model)
ii
iii SPSDP
DP
SPSDPDSP
rank
iSDP
iSP
Chen et al., “A probabilistic generative framework for extractive broadcast news speech summarization,” to appear in IEEE Transactions on Audio, Speech and Language Processing
28
SDS: Proposed Framework (2/3)
• A flowchart for the generative summarization framework
29
SDS: Proposed Framework (3/3)
• Features explored for modeling the sentence prior probability
– E.g., estimation of the relevance feature of a given spoken sentence
1
)(top top
LL
DD
DD
SavgSimLl
ul
LuDDD
D ul
ul
D D
30
SDS: Evaluation
• Preliminary tests on 100 radio broadcast news stories collected in Taiwan (automatic transcripts with 14.17% character error rate) – ROUGE-2 measure was used to evaluate the performance levels
of different models
– Proposed models (LM-RT, WTMM) are consistently better than the other models at lower summarization ratios
• LM-RT and WTMM are trained in a pure unsupervised manner, without any document-summary relevance information (labeled data)
VSM MMR LSA DIM SIG SVM LM-RT WTMM
10% 0.3073 0.3073 0.3034 0.3187 0.3144 0.3425 0.3684 0.3836
20% 0.3188 0.3214 0.2926 0.3148 0.3259 0.3408 0.3696 0.3772
30% 0.3593 0.3678 0.3286 0.3383 0.3428 0.3719 0.3840 0.3728
50% 0.4485 0.4501 0.3906 0.4345 0.4666 0.4660 0.4884 0.4615
31
Spoken Document Organization (1/3)
• Each document is viewed as a TMM model to generate itself – Additional transitions between topical mixtures have to do with
the topological relationships between topical classes on a 2-D map
Two-dimensional Tree Structure
for Organized Topics
2
2
2
,exp
2
1,
lk
klTTdist
TTE
K
sks
klkl
TTE
TTETTP
1,
,
jLij wwwwD ....21
1TwP i jDTP 1
jK DTP
2TwP i
Ki TwP
jDTP 2
A document model
Document
K
k
K
llnklikii TwPTTPDTPDwP
1 1
jD
32
Spoken Document Organization (2/3)
• Document models can be trained in an unsupervised way by maximizing the total log-likelihood of the document collection
• Initial evaluation results (conducted on the TDT2 collection)
– TMM-based approach is competitive to the conventional Self-Organization Map (SOM) approach
V
ijiji
n
jT DwPDwcL
11log,
Model Iterations distBet/distWithin
TMM 10 1.9165
SOM 100 2.0604
33
Spoken Document Organization (3/3)
• Each topical class can be labeled by words selected using the following criterion
• An example map for international political news
n
ijkji
n
jjkji
ki
DTPDwc
DTPDwc
TwSig
1
1
1,
,
,
34
Prototype Systems Developed at NTNU (1/3)
• Spoken Document Retrieval
http://sdr.csie.ntnu.edu.tw
35
Prototype Systems Developed at NTNU (2/3)
• Speech Retrieval and Browsing of Digital Archives
36
Prototype Systems Developed at NTNU (3/3)
• Speech-based Information Retrieval for ITS systems
– Projects supported by Taiwan Network Information Center (TWNIC)
Query: 我想找方圓五公里 賣臭豆腐的店家
37
Conclusions and Future Work
• Multimedia information access (over the Web) using speech will be very promising in the near future– Speech is the key for multimedia understanding and organization– Several task domains still remain challenging
• Spoken document retrieval provides good assistance for companies, for instance, in– Contact (Call)-center conservations: monitor agent conduct and
customer satisfaction, increase service efficiency– Content-providing services: such as MOD (Multimedia on Demand):
provide a better way to retrieve and browse descried program contents
Special Issue on Spoken Language Technology, IEEE Signal Processing Magazine, 25 (3), May 2008
Thank You!
39
References
[R1] B. Chen et al., “A Discriminative HMM/N-Gram-Based Retrieval Approach for Mandarin Spoken Documents,” ACM Transactions on Asian Language Information Processing, Vol. 3, No. 2, June 2004
[R2] J.R. Bellegarda, “Latent Semantic Mapping,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005
[R3] K. Koumpis and S. Renals, “Content-based access to spoken audio,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005
[R4] L.S. Lee and B. Chen, “Spoken Document Understanding and Organization,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005
[R5] T. Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, Vol. 42, 2001
[R6] B. Chen, “Exploring the Use of Latent Topical Information for Statistical Chinese Spoken Document Retrieval,” Pattern Recognition Letters, Vol. 27, No. 1, Jan. 2006
[R7] B. Chen, Y.T. Chen, "Extractive Spoken Document Summarization for Information Retrieval," Pattern Recognition Letters, Vol. 29, No. 4, March 2008
40
The Informedia System at CMU
• Video Analysis and Content Extraction (VACE)– http://www.informedia.cs.cmu.edu/
41
AT&T SCAN System
• SCAN: Speech Content Based Audio Navigator (1999)
Design and evaluate user interfaces to support retrieval
from speech archives
42
BBN Rough’n’Ready System (1/2)
• Distinguished Architecture for Audio Indexing and Retrieval (2002)
43
BBN Rough’n’Ready System (2/2)
• Automatic Structural Summarization for Broadcast News
44
SpeechBot Audio/Video Search System at HP Labs
• An experimental Web-based tool from HP Labs that used voice-recognition to create seachable keyword transcripts from thousands of hours of audio content
45
NTT Speech Communication Technology for Contact Centers
– CSR: Customer Service Representative
46
Google Voice Search (1/2)
• Google Voice Local Search
http://www.google.com/goog411/
47
Google Voice Search (2/2)
• Google Audio Indexing– Searching what people are saying inside YouTube videos (curre
ntly only for what the politicians are saying)
http://labs.google.com/gaudi
48
MIT Lecture Browser
• Retrieval and browsing of academic lectures of various categories
49
Noise Interference (1/2)
• Speech recognition performance degrades seriously when– There is mismatch between training and test acoustic conditions – However, training systems for all noise conditions is impractical
• A Simplified Distortion Framework
– Channel effects are usually assumed to be constant while uttering– Additive noises can be either stationary or non-stationary
(Convolutional Noise)
Channel Effect
Background Noise
(Additive Noise)
Noisy Speech
Clean Speech
tx th tn
tnthtxty
50
• Additive Environmental Noises Cause Non-linear Distortions in Speech Feature Space (due to logarithmic operations in feature extraction)
– When clean speech was corrupted by 10dB subway noise • Non-linear distortion effects were observed
Noise Interference (2/2)
1log10)( 10 10 N
s
N
s
E
E
E
EdB
1
0
*1 T
ns nsns
TE signal energy measured in time domain
51
Categorization of Robustness Approaches (1/4)
• Model Compensation– Adapt acoustic models to match corrupted speech features
• Feature Compensation– Restore corrupted speech features to corresponding clean ones
Clean Speech
Noisy Speech
Clean Acoustic Models
NoisyAcoustic Models
TrainingConditions
TestConditions
Feature Compensation
ModelCompensation
Feature Space Model Space
52
Categorization of Robustness Approaches (2/4)
• Model Compensation– Adapt acoustic models to match corrupted speech features
– Representative approaches: MAP, MLLR, MAPLR, etc.
• Feature Compensation– Restore corrupted speech features to corresponding clean ones
– Representative approaches: SS, CMN, CMVN, etc.
ModelCompensation
m
m
m
m
Clean Models Corrupted Models
Corrupted Speech
Corrupted Speech Features
FeatureCompensation
Compensation Method
CleanSpeech Features
53
Categorization of Robustness Approaches (3/4)
• Or alternatively, they can be classified into two categories according to whether the orientation of the methods is either from the feature domain or from the corresponding noise-resistant distribution characteristics
– Each of them has their own advantages and limitations– Approaches from these two categories could complement each
other
54
Categorization of Robustness Approaches (4/4)
Noisy Speech
Feature DomainStatistical Distribution
Cepstral Mean Normalization[Furui 1981]
Cepstral Mean/Variance Normalization[Vikki et al. 1998]
High Order Cepstral Moment Normalization
[Hsu et al. 2006]
Histogram Equalization[Molau 2003; Torre et al. 2005]
Quantile-based Histogram
Equalization[Hilger et al. 2006]
Probabilistic Optimum Filtering[Neumeyer et al. 1994]
Codeword Dependent Cepstral Normalization
[Acero 1990]
Feature Transformation
Heteroscedastic Linear Discriminant Analysis[Kumar 1997]
Kernel Linear Discriminant Analysis[Mika 1999]
Principle Component Analysis
Linear Discriminant Analysis
[Duda et al. 1973]
Heteroscedastic Discriminant Analysis[Saon et al. 2000]
Feature Compensation
Stereo-based Piecewise Linear Compensation
[Deng et al. 2000]
Stochastic Matching[Sankar et al. 1994]
Feature Normalization
Maximum Likelihood Stochastic Vector Mapping
[Wu et al. 2005]
Discriminative Stochastic Vector Mapping
Minimum Classification Error[Wu et al. 2006]
Maximum Mutual Information[Droppo et al. 2005]
Feature Reconstruction
Missing FeatureCluster-based[Raj et al. 2004]
Covariance-based[Raj et al. 2004]
55
Roots of Histogram Equalization (HEQ)
• Two basic assumptions of HEQ– Each feature vector dimension can be normalized independently of
each other (i.e., in a dimension-wise manner)– The effect of noise distortions is a monotonic (or order-preserving),
non-linear transformation of the speech feature space• The transformed speech feature distributions of the test (or
noisy) data should be identical to that of the training (or reference) data
1.0
xCTest
1.0 yCTrain x
y
Transformation Function
CD
F of
Test
Speech
CD
F of
Reference
Speech
Y
X
-3.1, -2.5, -0.3, 1.7, 3.2, 7.6
-2.6, -1.3, 0.2, 1.9, 4.3, 7.8
-2.6, 0.2, 0.3, 1.9, 4.3, 7.8
monotonic (but nonlinear) transformation
random behavior
56
Table-lookup Histogram Equalization (THEQ) (1/2)
• Due to a finite number of speech features being considered, the cumulative histograms are used instead of the CDFs– Table-lookup Histogram Equalization (THEQ)
• THEQ– A table consisting information of {(Quantile i, Restored Feature
Values)} is constructed during the training phase– To achieve better performance, the table size cannot be too
small• The needs of huge disk storage consumption• Table-lookup is also time-consuming
57
Table-lookup Histogram Equalization (THEQ) (2/2)
• Schematic diagram of THEQ
58
Quantile-based Histogram Equalization (QHEQ)
• QHEQ attempts to calibrate the CDF of each feature vector component of the test data to that of training data in a quantile-corrective manner– Instead of full matching of cumulative histograms
• A parametric transformation function is used
• For each sentence, the optimum parameters and should be obtained from the quantile correction step
– Exhaustive online grid search is required: time-consuming
KKK Q
x
Q
xQxH
1
1
1
2
,minarg,
K
k
trainkk QQH
Tpospos
T
xx
xx
,,
,,
1
1
KQkQ1Q
59
Polynomial-Fit Histogram Equalization (PHEQ) (1/2)
• We propose to use least squares regression for fitting the inverse function of CDFs of training speech– For each speech feature vector dimension of the training data, a
polynomial function can be expressed as follows, given a pair of and corresponding CDF
– The corresponding squared error
– Coefficients can be estimated by minimizing the squared error
M
m
miTrainmiiTrain yCayyCG
0
~
N
i
M
m
miTrainmi yCayE
1
2
0
2'
iTrain yCiy
ma
60
Polynomial-Fit Histogram Equalization (PHEQ) (2/2)
• Estimation of CDF Values– For each feature vector dimension, , the CDF
value of each frame can be estimated using the following steps • are sorted in an ascending order• The corresponding CDF value of each frame is approximated by
– Where is an indication function, indicating the position of in the sorted data
– During recognition• The CDF value of each test frame is estimated and taken
as the input to the corresponding inverse function G to obtain a restored feature component
NN yyyY ,, 21,1
N
ySyC iposi
NY ,1
ipos yS
iyC
iy
61
Spatial-Temporal Feature Distribution Characteristics (STDC) (1/2)
• The feature vector components of different dimensions are not completely de-correlated
• The effect of noise distortion may not be a monotonic transformation of the speech feature space– However, speech feature distribution characteristics are robust
when facing noise interferences
• Speech signals are slowly time-varying, the contextual relationships of consecutive speech frames might provide additional information clues for feature normalization
• We therefore propose a feature-distribution-feature transformation incorporating spatial-temporal information clues for speech compensation
62
Spatial-Temporal Feature Distribution Characteristics (STDC) (2/2)
• Schematic diagram of STDC
transformation matrix
lT
l SX Aˆ
A
restored feature vector lX
feature vector sequence
corresponding CDF vectorsequence
lX 1lX KlX 1lX1lX
lC 1lC KlC 1lC1lC
lSspliced CDF vector
lT
l SX Aˆ
N
lll XXE
1
22 ˆ
63
Comparison of Various HEQ Approaches
Method Storage
Requirement Computational
Complexity
THEQLarge - depending on the
number of reference pairs kept in the look-up table
Medium - depending on the look-up table size for
searching the corresponding restored value
QHEQSmall - depending on the
number of quantiles for quantile-correction
High - depending on the value ranges and the resolutions
of parameters for online grid search
PHEQSmall - depending on the order
of the polynomial functionsLow - depending on the order of
the polynomial function
STDCSmall - depending on the frame
number used for transformation
Low - depending on the frame number used for transformation
64
Experimental Results
• The speech recognition experiments were conducted under various noise conditions using the Aurora-2 database and task (clean-condition training) – word rate error results
Method Set A Set B Set C Average
MFCC 41.06 41.52 40.03 41.04
THEQ 22.76 21.16 23.47 22.47
QHEQ 23.53 21.90 22.36 22.64
PHEQ 20.98 20.17 21.43 20.75
STDC 18.02 16.98 18.97 17.99
MVA 18.38 16.14 21.81 18.17
ETSI 38.69 44.25 28.76 38.93
CMVN 27.73 24.60 27.17 26.37
HLDA-CMVN 21.63 21.37 21.59 21.52
65
Word Topical Mixture Model (1/4)
• In this research, each word of language are treated as a word topical mixture model (WTMM) for predicting the occurrences of other words
• WTMM in LM Adaptation– Each history consists of words– History model is treated as a composite word TMM – The history model of a decoded word can be dynamically
constructed
K
kwkkiwi jj
MTPTwPMwP1
|||
K
kHkki
K
k
i
jwkjki
i
j
K
kwkkijwi
i
jjwi
jwj
jji
MTPTwPMTPTwP
MTPTwPMwPHwP
1
'
1
1
1
1
1 1
1
1
||||
||||
66
Word Topical Mixture Model (2/4)
• Exploration of Training Exemplars– Collect the words within a context window around each
occurrence of word in the training corpus – Concatenate them to form the relevant observations for training
the word TMM
– Maximize the sum of log-likelihoods of WTMM models generating their corresponding training exemplars
TrainSetjw jwn
jj
TrainSetjw
jjTrainSetQ Qw
wnwnQ
ww MwPQwnMQPLQQ
Q |log,|loglog
jw
1,jwQ 2,jw
Q Nw jQ ,
Nwwww jjjjQQQQ ,2,1, ,,,
jw jw
67
Word Topical Mixture Model (3/4)
• Recognition using WTMM models– A simple linear combination of WTMM models of the words occur
ring in the search history
– Weights are empirically set to be exponentially decayed as the words in the history are apart from current decoded word
1wi MwP
1
2wi MwP
1iwi MwP
2
1i
A decoded word
iw
A composite word TMM model for the
search history 121 ,,, iiwwwwH
68
Word Topical Mixture Model (4/4)
• Experiments: Comparison of WTMM, PLSALM, TBLM
520.31510.26 507.13
499.3
540.52527.82
519.71514.34
499.63
481.86
467.62
501.01
533.07
559.84
627.59614.7
19.5519.6919.7619.8
19.9920.06
20.13
19.95
20.0720.0920.13
19.99
20.69 20.63 20.63
20.25
450
470
490
510
530
550
570
590
610
630
650
16 topics/50K pairs 32 topics/80Kpairs 64 topics/400Kpairs128 topics/900K pairs
PP
19.0
19.2
19.4
19.6
19.8
20.0
20.2
20.4
20.6
20.8
CE
R(%
)
WTMMPLSATBLM_MITBLM_TFIDF
WTMM performs slightly better than PLSALM and TBLM in CER measure
TBLM trained with MI score performs better in PP (perplexity) measure
69
SDS: Other Approaches (1/4)
• Vector Space Model (VSM)– Vector representations of sentences and the document to be
summarized using statistical weighting such as TF-IDF– Sentences are ranked based on their proximity to the document– To summarize more important and different concepts in a
document• The terms occurring in the sentence with the highest
relevance score are removed from the document• The document vector is then reconstructed and the ranking of
the rest of the sentences is performed accordingly
• Or, using the Maximum Marginal Relevance (MMR) model
– : the set of already selected sentences
SummSSimDSSimNextSen lilSl
,1,max
il DSSim ,
Y. Gong, SIGIR 2001
Summ
70
SDS: Other Approaches (2/4)
• Latent Semantic Analysis (LSA)– Construct a “term-sentence” matrix for a given document– Perform SVD on the “term-sentence” matrix
• The right singular vectors with larger singular values represent the dimensions of the more important latent semantic concepts in the document
• Represent each sentence of a document as a vector in the latent semantic space
– Sentences with the largest index (element) values in each of the top L right singular vectors are included in the summary (LSA-1)
S1 S2 S3SN
w1w2
w3
wM
w1w2
w3
wM
S1 S2 S3 SN
rirv
L
rirri vSScore
1
2
Sentences also can be selected based on their norms (LSA-2):
Hirohata et al., ICASSP2005
U
Σ tV
Y. Gong, SIGIR 2001
71
SDS: Other Approaches (3/4)
• Sentence Significance Score (SenSig)– Sentences are ranked based on their significance which, for exa
mple, is defined by the average importance scores of words in the sentence
– Other features such as word confidence, linguistic score, or prosodic information also can be further integrated into this method
SN
nn
S
wIN
SSenSig1
1
w
Cwwn F
FficffwI log
similar to TF-IDF weighting
S. Furui et al., IEEE SAP 12(4), 2004
72
SDS: Other Approaches (4/4)
• Classification-Based Methods– Sentence selection is formulated as a binary classification
problem– A sentence can either be included in a summary or not
• A bulk of classification-based methods using statistical features also have been developed– Gaussian mixture models (GMM)– Bayesian network classifier (BN) – Support vector machine (SVM)– Logistic Regression (LR)
• However, the above methods need a set of training documents together with their corresponding handcrafted summaries (or labeled data) for training the classifiers