Discriminative Training Approaches for Continuous Speech Recognition Jen-Wei Kuo, Shih-Hung Liu, Berlin Chen, Hsin-min Wang Speaker : 郭人瑋 Main reference:

Discriminative Training Approaches for Continuous Speech Recognition

Jen-Wei Kuo, Shih-Hung Liu, Berlin Chen, Hsin-min Wang

Speaker : 郭人瑋

Main reference: D. Povey. “Discriminative Training for Large Vocabulary Speech Recognition,” Ph.D Dissertation, Peterhouse, University of Cambridge, July 2004

2

Statistical Speech Recognition

– In this presentation, language model is assumed to be given in advance while acoustic model is needed to be estimated

• HMMs (hidden Markov models) are widely adopted for acoustic modeling

)(Wp)|( WOp

RecognizedSentence

Speech AcousticMatch

FeatureExtraction

LinguisticDecoding

3

Expected Risk

• Let be a finite set of various possible word sequences for a given observation utterance– Assume that the true word sequence is also in

• Let be the action of classifying a given observation sequence to a word sequence – Let be the loss incurred when we take such an

action (and the true word sequence is just )

• Therefore, the (expected) risk for a specific action

,,, 321 WWWr W

jr WO

rO

rW

),( jn WWl

r

nWrnjnrjr OWPWWlOWOR

W)|(),()|(

rjW W

rW

jr WO

rO

nW rn WW

rW

[Duda et al. 2000]

4

Decoding: Minimum Expected Risk (1/2)

• In speech recognition, we can take the action with the minimum (expected) risk

• If zero-one loss function is adopted (string-level error)

– Then

r

njj Wrnjn

Wrjr

WOWPWWlOWOR

W)|(),(minarg)|(minarg

if 1

if 0),(

jn

jnjn WW

WWWWl

)|(1minarg

)|( minarg)|(minarg,

rjW

WWWrn

Wrjr

W

OWP

OWPOWOR

j

jnr

njj

W

5

Decoding: Minimum Expected Risk (2/2)

– Thus,

• Select the word sequence with maximum posterior probability (MAP decoding)

• The string editing or Levenshtein distance also can be accounted for the loss function– Take individual word errors into consideration– E.g., Minimum Bayes Risk (MBR) search/decoding [V. Goel et al.

2004], Word Error Minimization [Mangu et al. 2000]

)|(maxarg)|(minarg rjW

rjrW

OWPOWORjj

6

Training: Minimum Overall Expected Risk (1/2)

• In training, we should minimize the overall (expected) loss of the actions of the training utterances– is the true word sequence of

– The integral extends over the whole observation sequence space

• However, when a limited number of training observation sequences are available, the overall risk can be approximated by

rr WO

rO

rrrrroverall dOOpOWORR )()|)((

rO

rW

R

R

r Wrrnrn

r

R

rrrroverall

rn

OpOWPWWl

OpOWORR

1

1

)()|(),(

)()|)((

W

7

Training: Minimum Overall Expected Risk (2/2)

• Assume to be uniform – The overall risk can be further expressed as

• If zero-one loss function is adopted

– Then

)( rOp

R

r Wrnrnoverall

rn

OWPWWlR1

)|(),(W

rn

rnrn WW

WWWWl

if 1

if 0),(

r rrrWWW

rnOverall OWPOWPRrn

rn

1,W

8

Training: Minimum Error Rate

• Minimum Error Rate (MER) estimation

– MER is equivalent to MAP

r rrrWWW

rnOverallMER OWPOWPRrn

rn

1minmin,W

r rrOverallMER OWPR

maxmin

9

Training: Maximum Likelihood (1/2)

• The objective function of Maximum Likelihood (ML) estimation can be obtained if Jensen Inequality is further applied

– Maximize the overall log-posterior of all training utterances

r rr

rr

rrrr rrML

WOP

OP

WPWOPOWP

logmaxarg

logmaxarglogmaxarg

xx

xx

log1

log1

r rr

r rrOverall

OWP

OWPR

log

1

r rrML WOPF log

minimize theupper bound

uniform

Independent of

[Schlüter 2000]

10

Training: Maximum Likelihood (2/2)

• On the other hand, the discriminative training approaches attempt to optimize the correctness of the model set by formulating an objective function that in some way penalizes the model parameters that are liable to confuse correct and incorrect answers

• MLE can be considered as a derivation from overall log-posterior

11

Training: Maximum Mutual Information (1/3)

• The objective function can be defined as the sum of the pointwise mutual information of all training utterances and their associated true word sequences

• The maximum mutual information (MMI) estimation tries to find a new parameter set ( ) that maximizes the above objective function

rk

Wkr

rr

rr

rrr

rr

rr

r rrMMI

WPWOP

WOP

OP

WOP

WPOP

WOP

WOF

rk W

log

log,

log

,MI

[Bahl et al. 1986]

12


• An alternative derivation based on the overall expected risk criterion

– Which is equivalent to the maximization of overall log-posterior of all training utterances

r

Wkkr

rr

rr

rrrr rrMMI

rk

WPWOP

WOP

OP

WPWOPOWP

W

logmaxarg

logmaxarglogmaxarg

rk

Wkr

rrMMI WPWOP

WOPF

rk W

log

Independent of

include rW

13


• When we maximize the MMIE objection function– Not only the probability of true word sequence (numerator,

like the MLE objective function) can be increased, but also can the probabilities of other possible word sequences (denominator) be decreased

– Thus, MMIE attempts to make the correct hypothesis more probable, while at the same time it also attempts to make incorrect hypotheses less probable

• MMIE also can be considered as a derivation from overall log-posterior

14

Training: Minimum Classification Error (1/2)

• The misclassification measure is defined as

• Minimization of the overall misclassification measure is similar to MMIE when language model is assumed uniformly distributed

rkr

k WWWkr

rrrr WOP

CWOPOd

,

1log|-log)(

W

rk

rk WWW

rC,

1W

r

Wkr

rrrMMI

rk

WOPC

WOPFW

1

1loglog-)(

[Chou 2000]

15

Training: Minimum Classification Error (2/2)

• Embed a sigmoid (loss) function to smooth the misclassification measure

• Let and , then we have

• Minimization of the overall loss directly minimizes (classification) error rate, so MCE can be regarded as a derivation from MER

dr eOdl

1

1))((

r

kWkr

rr

rr WOP

WOP

COdl

W

|1

1))((

1 rClog

16

Training: Minimum Phone Error

• The objective function of Minimum Phone Error (MPE) is directly derived from the overall expected risk criterion

– Replace the loss function with the so-called accuracy function

• MPE tries to maximize the expected (phone or word) accuracy of all possible word sequences (generated by the recognizer) regarding the training utterances

r riW

Wkkr

iir

r riW

riMPE

WWAWPWOp

WPWOp

WWAOWpF

ri

rk

ri

),()()|(

)()|(

),()|()(

WW

W

),( ri WWA),( ri WWl

[Povey 2004]

17

Objective Function Optimization

• Objective function has the “latent variable” problem, such that it can not be directly optimized

Iterative optimization

• Gradient-based approaches• E.g., MCE

• Expectation Maximum (EM)– strong-sense auxiliary function

• E.g., MLE – Weak-sense auxiliary function

• E.g., MMIE, MPE

18

Three Steps for EM

• Step 1.Draw a lower bound– Use the Jensen’s inequality

• Step 2.Find the best lower bound auxiliary function– Let the lower bound touch the objective function at

current guess

• Step 3.Maximize the auxiliary function– Obtain the new guess– Go to Step 2 until converge

[Minka 1998]

19

)(F

objective function

current guess

Step 1.Draw a lower bound (1/3)

20

)(F

)()( Fg


)(glower bound function

objective function

21


S

S

S

Sp

SOPSp

Sp

SOPSp

SOPOP

)(

)|,(log)(

)(

)|,()(log

)|,(log)|(log

Apply Jensen’s Inequality

The lower bound function of )(F

ondistributiy probabilitarbitrary An : )(Sp

dslower boun infinite

22

)(F

),( g

Step 2.Find the best lower bound (1/4)

objective function

lower bound function

23


– Let the lower bound touch the objective function at current guess

– Find the best at

SSp

S

Sp

SOPSpSp

SpSp

SOPSp

)(

)|,(log)(maxarg)(best The

at )( w.r.t )(

)|,(log)( maximize want to We

)(

)(Sp

24


),|()|,(

)|,()(

)|,(

1)|,(

)()|,(

)(

1)|,(log)(log

01)(log)|,(log

1)(log)|,(log

)(log)()|,(log)()(1

1)(constrain theenforce

to multiplier Lagrange introduce We

OSPSOP

SOPSp

SOP

ee

e

SOPeSp

e

SOPeSp

SOPSp

SpSOP

SpSOP

SpSpSOPSpSp

Sp

SS

S

S

SSS

S

After derivation w.r.t )(Sp

Set it to zero

25


SS

S

S

S

OSPOSPSOPOSP

OSP

SOPOSPg

OPOSP

SOPOSPg

F

gOSP

SOPOSPDefine

),|(log),|()|,(log),|(

),|(

)|,(log),|(),(

)|(log),|(

)|,(log),|(),(

check thatcan also We

)(for function auxiliary The

),(),|(

)|,(log),|(

Q function

26

)(F

),( g

Step 3.Maximize the auxiliary function (1/3)

NEW

auxiliary function

27

)(F


NEW

)()( FF NEW

objective function

28

)(F


NEW

objective function

29

)(F

),( g

Step 2.Find the best lower bound

objective function

auxiliary function

30

)(F

NEW

Step 3.Maximize the auxiliary function

)()( FF NEW

objective function

31

Strong-sense Auxiliary Function

• If is said to be a strong-sense auxiliary function for around ,iff

)()(),(),( FFgg

),( g

)(F

),( g)(F

[Povey et al. 2003]

32

Weak-sense Auxiliary Function (1/5)

• If is said to be a weak-sense auxiliary function for around ,iff

),( g)(F

)(),( Fg

),( g

)(F

33

)(F

)( ),( aroundaldifferentisamehaveFandg


),( g

objective function

auxiliary function

34

)(F


),( g

NEW

objective function

auxiliary function

35

)(F

NEW


)()( holdnotmayFF NEW

objective function

36


• If is said to be a smooth function around ,iff

• Speed up convergence• Provide more stable estimate

),( smg

),(),( smsm gg

)(F

),( smg

37

)(F

Smooth Function (1/2)

),( smg

objective function

smooth function

38

Smooth Function (2/2)

)(F

),(),( smgg

NEW

objective function

is also a weak-senseauxiliary function

39

MPE: Discrimination

• The MPE objective function is less sensitive to portions of the training data that are poorly transcribed

• A (word) lattice structure can be used here to approximate the set of all possible word sequences of each training utterance – Training statistics can be efficiently computed via such structure

rW

rlatW

40

MPE: Auxiliary Function (1/2)

• The weak-sense auxiliary function for MPE model updating can be defined as

– is a scalar value (a constant) calculated for each phone arc q, and can be either positive or negative (because of the accuracy function)

– The auxiliary function also can be decomposed as

)|(log)|(log

),(1

qOpqOp

FH r

R

r q r

MPEMPE

rlat

W

)|(log qOp

F

r

MPE

R

r qr

r

MPE

R

r qr

r

MPEMPE

rlat

rlat

qOpqOp

F

qOpqOp

FH

1

1

)|(log)|(log

,0max

)|(log)|(log

,0max),(

W

W

arcs with positive

contributions(so-called numerator)

arcs with negative contributions

(so-called denominator)

r riW

Wkkr

iirMPE WWA

WPWOp

WPWOpF

rir

k

),()()|(

)()|()(

WW

still have the “latent variable” problem

axg

xag

xg

xH

x

xg

x

xH

xg

xF

x

xg

x

xF

41

MPE: Auxiliary Function (2/2)

• The auxiliary function can be modified by considering the normal auxiliary function for

– The smoothing term is not added yet here

• The key quantity (statistics value) required in MPE training is , which can be termed as

)|(log qOp r

),,,()|(log

),(1

qrQqOp

FH ML

R

r q r

MPEMPE

rlat

W

)|(log qOp

F

r

MPEMPEq

)|(log qOp

F

r

MPE

),,,( qrQML

42

MPE: Statistics Accumulation (1/2)

• The objective function can be expressed as (for a specific phone arc )

• The differential can be expressed as

)(

)()|()()|(

),()()|(),()()|(

)()|(

),()()|((

,,

,,

R

r r

r

R

r WqW kkrWqW iir

WqW riiirWqW riiir

R

r W kkr

W riiirMPE

b

a

WPWOpWPWOp

WWAWPWOpWWAWPWOp

WPWOp

WWAWPWOpF

krlatkk

rlatk

irlatii

rlati

rlati

rlati

WW

WW

W

W

2

)|(log)|(log

)|(log)|(log

(

r

rrr

r

r

r

MPE

b

qOp

bab

qOp

a

qOp

b

a

qOp

F

q

43

MPE: Statistics Accumulation (2/2)

ravgr

rq

R

r

W kkr

W riiir

W kkr

WqW kkr

W kkr

WqW kkr

WqW kkr

WqW riiir

r

rWqW kkr

rr

WqW riiir

iirr

ir

r

r

r

r

r

r

r

r

cqc

WPWOp

WWAWPWOp

WPWOp

WPWOp

WPWOp

WPWOp

WPWOp

WWAWPWOp

b

a

qOp

WPWOp

bqOp

WWAWPWOp

WqifWOpqOp

WOp

b

qOpb

b

a

b

qOpa

rlatk

rlati

rlatk

krlati

rlatk

krlati

krlatk

irlati

krlati

irlati

W

W

W

W

W

W

W

W

W

W

)()|(

),()()|(

)()|(

)()|(

)()|(

)()|(

)()|(

),()()|(

|log

)()|(

1

|log

),()()|(

?? )|()|(log

)|(

)|(log)|(log

,

:

:

,

2,

,

)(qcrq

ravgc

rq

The average accuracy of sentences passing through the arc q

The likelihood of the arc q

The average accuracy of all the sentences in the word graph

xKdcax

x

xK

x

x

x

xK

dcxaxK

loglog

44

MPE: Accuracy Function (1/4)

• and can be calculated in an approximation way using the word graph and the Forward-Backward algorithm

• Note that the exact accuracy function is express as the sum of phone-level accuracy over all phones , e.g.

– However, such accuracy is obtained by full alignment between the true and all possible word sequences, which is computational expensive

qcrravgc

qA q

insertion if 1on substituti if 0

phonecorrect if 1)(qA

45


• An approximated phone accuracy is defined

– : the ration of the portion of that is overlapped by

phonesdifferent if ),(1

phone same are and if ),(21max)(

zqe

qzzqeqA

z

),( zqe z q

1. Assume the true word sequence has no pronunciation variation 2. Phone accuracy can be obtained by simple local search3. Context-independent phones can be used for accuracy calculation

46


• Forward-Backward algorithm for statistics calculation– Use “phone graph” as the vehicle

for 開始時間為 0的音素 q

endfor t=1 to T-1 for 開始時間為 t的音素 q

for 結束時間為 t-1且可連至 q的音素 r

end

for 結束時間為 t-1且可連至 q的音素 r

end

endend

)(qAq )|( qOpq

0q

)|( rqprqq

)(qAq

rq

rqq

rqp

)|(

)|( qOpqq

Forward

47


for 結束時間為 T-1的音素 q

for t=T-2 to 0 for 結束時間為 t的音素 q

for 開始時間為 t+1且可連至 q的音素r

end

for 開始時間為 t+1且可連至 q的音素r

end endend

0q

)|()|( rOpqrprqq

0q

)()|()|(

rArOpqrp

rq

rqq

0q

1qBackward

for 每一音素 q

end

qT q

qT qq

avgc的音素分枝所有結束時間為

的音素分枝所有結束時間為

1

1

qqqc )(

))(( avgqMPEq cqc

qqq

qqq

48

MPE: Smoothing Function

• The smoothing function can be defined as

– The old model parameters( ) are used here as the hyper-parameters

– It has a maximum value at

)()()(|)log(|2

),( 11 mmmmmT

mmmm

m

sm trD

H

0

)()(2

),(

022

),(

11

1

mm

mm

mm

mm

mm

mm

mm

Tm

Tm

Tm

T

mmmmmmT

mm

m

sm

mmmm

m

sm

DH

DH

mm ,

49

MPE: Final Auxiliary Function (1/2)

)|(log)|(log

)(),( qOp

qOp

FH r

r

MPE

qrMPE

rlat

W

),),((log)(),( mmrrqm

MPErq

m

et

stqrMPE toNtg

q

qrlat

W

weak-sense auxiliary function

strong-sense auxiliary function

smoothing function involved

)()()(|)log(|2

),),((log)(),(

11

mmmmmT

mmmm

m

mmrrqm

MPErq

m

et

stqrMPE

trD

toNtgq

qrlat

W

weak-sense auxiliary function

50

MPE: Final Auxiliary Function (2/2)

r ri

WW

kkr

iirMPE WWA

WPWOp

WPWOpF

rlati

rlatk

),()()|(

)()|()(

WW

)|(log)|(log

)(),( qOp

qOp

FH r

r

MPE

qrMPE

rlat

W

),),((log)(),( mmrrqm

MPErq

m

et

stqrMPE toNtg

q

qrlat

W

Weak-sense auxiliary function

Strong-sense auxiliary

Weak-senseAdd smooth function

)()()(|)log(|2

),),((log)(),(

11

mmmmmT

mmmm

m

mmrrqm

MPErq

m

et

stQqrMPE

trD

toNtgq

qr

51

MPE: Model Update (1/2)

• Based on the final auxiliary function, we have the following update formulas

22222

2

}{

)()}()({

}{

)}()({

mdmd

denm

numm

mdmdmddenmd

nummd

md

mddenm

numm

mdmddenmd

nummd

md

D

DOO

D

DOO

m

denm

numm

mmdenm

numm

m D

DOO

)()(

Tmm

mdenm

numm

Tmmmm

denm

numm

m D

DOO

)()( 22

diagonal covariance matrix

dimension feature :

component mixture:

d

m

correlation matrix TN

i

TioioN

1

1

52

MPE: Model Update (2/2)

• Two sets of statistics (numerator, denominator) are accumulated respectively

R

r q

e

str

rq

rqm

denm

R

r q

e

str

rq

rqm

denm

R

r q

e

st

rq

rqm

denm

rlat

q

q

MPE

rlat

q

q

MPE

rlat

q

q

MPE

tOtO

tOtO

t

1

22

1

1

)(),0max()()(

)(),0max()()(

),0max()(

W

W

W

R

r q

e

str

rq

rqm

numm

R

r q

e

str

rq

rqm

numm

R

r q

e

st

rq

rqm

numm

rlat

q

q

MPE

rlat

q

q

MPE

rlat

q

q

MPE

tOtO

tOtO

t

1

22

1

1

)(),0max()()(

)(),0max()()(

),0max()(

W

W

W

53

MPE: Setting Constants (1/2)

• The mean and variance update formulas rely on the proper setting of the smoothing constant ( )– If is too large, the step size is small and convergence is slow – If is too small, the algorithm may become unstable– also needs to make all variance positive

mdD

2md

mdm DD or

mdD

0)]([)()(2)(

0))(()])([)((

0)(][)(

][)(

)(

2222222

2222

2222

2222

2

OODdOOD

DODDO

D

DO

D

DO

D

DO

D

DO

mdjmmdmdmdmdmmdmmdmdmdmd

mdmdmdmdmmdmdmdmd

mdm

mdmdmd

mdm

mdmdmdmd

mdmdm

mdmdmdmdm

mdm

mdmdmdmd

A B C

mdD

54

MPE: Setting Constants (2/2)

• Previous work [Povey 2004] used a value of that was twice the minimum positive value needed to insure all variance updates were positive

A

CABBDmd

2

42

mdD

55

MPE: I-Smoothing

• I-smoothing increases the weight of the numerator counts depending on the amounts of data available for each Guassian

• This is done by multiplying the numerator terms

( ) in the update formulas by

– can be set empirically (e.g., )

)(,)(, 2OO numm

numm

numm

mnumm

numm

MLmML

m

mnumm

numm

MLmML

m

mnumm

numm

OOO

OOO

)()()(

)()()(

222

emphasize positive contributions (arcs with higher accuracy)

m 100m

56

Preliminary Experimental Results (1/3)

CER(%) ML_itr10 ML_itr10

MPE_itr10

ML_itr150 ML_itr150

MPE_itr10

MFCC(CMS) 29.48 26.52 28.72 24.44

MFCC(CN) 26.60 25.05 26.96 24.54

HLDA+MLLT(CN)

23.64 20.92 23.78 20.79

• Experiment conducted on the MATBN (TV broadcast news) corpus (field reporters)– Training: 34,672 utterances (25.5hrs)

– Testing: 292 utterances (1.45hrs, outside-testing)– Metric: Chinese character error rate (CER)

12.83% relative improvement11.51% relative improvement

57


• Another experiment conducted on the MATBN (TV broadcast news) corpus (field reporters) – Discriminative Feature: HLDA+MLLT(CN)– Discriminative training: MPE– Total 34,672 utterances (24.5hrs) - (10-fold cross validation)

A relative improvement of 26.2% finally achieved (HLDA-MLLT(CN)+ MPE)

21.4318.43

15.8

0

5

10

15

20

25

MFCC_CN ML HLDA+MLLT_CNML

HLDA+MLLT_CNMPE

CER(%

)

58


• Corpus segmentation conducted on the TIMIT corpus– 50 phone categories

– Training: 4546 utterances (3.8 hrs)

– Testing: 1646 utterances (1.4 hrs)

• Minimum Error Length (MEL)

7.00

8.00

9.00

10.00

11.00

12.00

13.00

14.00

15.00

16.00

17.00

0(ML)

1 2 3 4 5 6 7 8 9 10 iter

Frame Err(%)

TrainSet TestSet TrainSet cri

r

ri

SS

jr

irMEL SSErrLen

SOp

SOpF

ri

rj

),()|(

)|()(

SegSeg

ww

eh ih

eh w

w

ih

ih

sil

w

eh

eh

w

eh

eh

ih

w

dh

ih

sil w

w

w

wih

dhdh

dh

dhey

eyey

Lattice generated by force alignment

59


• Corpus segmentation conducted on the TIMIT corpus– 50 phone categories

– Training: 4546 utterances (3.8 hrs)

– Testing: 1646 utterances (1.4 hrs)

• Minimum Error Length (MEL)

7.00

8.00

9.00

10.00

11.00

12.00

13.00

14.00

15.00

16.00

17.00

0(ML)

1 2 3 4 5 6 7 8 9 10 iter

Frame Err(%)

TrainSet TestSet TrainSet cri

r

ri

SS

jr

irMEL SSErrLen

SOp

SOpF

ri

rj

),()|(

)|()(

SegSeg

ww

eh ih

eh w

w

ih

ih

sil

w

eh

eh

w

eh

eh

ih

w

dh

ih

sil w

w

w

wih

dhdh

dh

dhey

eyey


60

ww

eh ih

eh w

w

ih

ih

sil

w

eh

eh

w

eh

eh

ih

w

dh

ih

sil w

w

w

wih

dhdh

dh

dhey

eyey


61

Conclusions & Future work

• MPE/MWE (or MMI) based discriminative training approaches have shown effectiveness in Chinese continuous speech recognition– Joint training of feature transformation, acoustic models and

language models

– Unsupervised training

– More in-deep investigation and analysis are needed

– Exploration of variant accuracy/error functions

62

References

• D. Povey, P.C. Woodland, M.J.F. Gales, “Discriminative MAP for Acoustic Model Adaptation,” in Proc. ICASSP 2003

• R. Schluter, W. Macherey, B. Muller, H. Ney, “Comparison of discriminative training criteria and optimization methods for speech recognition,” Speech Communication 34, 2001

• K. Vertanen, “An Overview of Discriminative Training for Speech Recognition”

• V. Goel, S. Kumar, W. Byrne, “Segmental minimum Bayes-risk decoding for automatic speech recognition,” IEEE Transactions on Speech and Audio Processing, May 2004

• J.W. Kuo, “An Initial Study on Minimum Phone Error Discriminative Learning of Acoustic Models for Mandarin Large Vocabulary Continuous Speech Recognition” Master Thesis, NTNU, 2005

• J.W. Kuo, B. Chen, "Minimum Word Error Based Discriminative Training of Language Models," Eurospeech 2005

63

References

• Lidia Mangu, Eric Brill, Andreas Stolcke, “Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks,” Computer Speech and Language 14, 2000

• R. O. Duda, P. E. Hart and D. G. Stork , Pattern Classification, Second Edition. New York: John & Wiley, 2000

• R. Schlüter , “Investigations on Discriminative Training Criteria,” Ph.D Dissertation, RWTH Aachen - University of Technology, September 2000

• L. R. Bahl, P. F. Brown, P. V. de Souza and R. L. Mercer, “Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition,” in Proc. ICASSP’86

• Wu Chou, “Discriminant-Function-Based Minimum Recognition Error Rate Pattern-Recognition Approach to Speech Recognition,” Proceedings of the IEEE, Vol. 88, No. 8, 2000

• T. Minka. Expectation-Maximization as lower bound maximization. http://research.microsoft.com/~minka/papers/em.html, 1998

http://research.microsoft.com/~minka/papers/em.html

http://research.microsoft.com/~minka/papers/em.html

Thank You !

65

Appendix: MMI v.s. MPE

24.50

25.00

25.50

26.00

26.50

27.00

27.50

28.00

0 1 2 3 4 5 6 7 8 9 10訓練次數

(%)字錯誤率 ML MMI MPE

CER relative reduction: MMI 8% MPE 12%

27.88

24.51

25.62

Documents

Discriminative Training Approaches for Continuous Speech Recognition Jen-Wei Kuo, Shih-Hung Liu, Berlin Chen, Hsin-min Wang Speaker : 郭人瑋 Main reference: