61
Speech Recognition HUNG-YI LEE ๆŽๅฎๆฏ…

Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

SpeechRecognitionHUNG-YI LEE ๆŽๅฎๆฏ…

Page 2: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

LAS: ๅฐฑๆ˜ฏ seq2seq

CTC: decoder ๆ˜ฏ linear classifier ็š„ seq2seq

RNA: ่ผธๅ…ฅไธ€ๅ€‹ๆฑ่ฅฟๅฐฑ่ฆ่ผธๅ‡บไธ€ๅ€‹ๆฑ่ฅฟ็š„ seq2seq

RNN-T: ่ผธๅ…ฅไธ€ๅ€‹ๆฑ่ฅฟๅฏไปฅ่ผธๅ‡บๅคšๅ€‹ๆฑ่ฅฟ็š„ seq2seq

Neural Transducer: ๆฏๆฌก่ผธๅ…ฅไธ€ๅ€‹ window ็š„ RNN-T

MoCha: window ็งปๅ‹•ไผธ็ธฎ่‡ชๅฆ‚็š„ Neural Transducer

Last Time

Page 3: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Two Points of Views

Source of image: ๆŽ็ณๅฑฑ่€ๅธซใ€Šๆ•ธไฝ่ชž้Ÿณ่™•็†ๆฆ‚่ซ–ใ€‹

Seq-to-seq HMM

Page 4: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Hidden Markov Model (HMM)

SpeechRecognition

speech text

X Y

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘ƒ Y|๐‘‹

= ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘ƒ ๐‘‹|Y ๐‘ƒ Y

๐‘ƒ ๐‘‹

= ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘ƒ ๐‘‹|Y ๐‘ƒ Y

๐‘ƒ ๐‘‹|Y : Acoustic Model

๐‘ƒ Y : Language Model

HMM

Decode

Page 5: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

HMM

hh w aa t d uw y uw th ih ng k

what do you think

t-d+uw1 t-d+uw2 t-d+uw3

โ€ฆโ€ฆ t-d+uw d-uw+y uw-y+uw y-uw+th โ€ฆโ€ฆ

d-uw+y1 d-uw+y2 d-uw+y3

Phoneme:

Tri-phone:

State:

๐‘ƒ ๐‘‹|Y ๐‘ƒ ๐‘‹|๐‘†

A token sequence Y corresponds to a sequence of states S

Page 6: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

HMM๐‘ƒ ๐‘‹|Y ๐‘ƒ ๐‘‹|๐‘†

A sentence Y corresponds to a sequence of states S

๐‘Ž ๐‘ ๐‘

Start End

๐‘† ๐‘‹

Page 7: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

HMM๐‘ƒ ๐‘‹|Y ๐‘ƒ ๐‘‹|๐‘†

A sentence Y corresponds to a sequence of states S

Transition Probability

๐‘ ๐‘|๐‘Ž๐‘Ž ๐‘

Emission Probability

t-d+uw1

d-uw+y3

P(x|โ€t-d+uw1โ€)

P(x|โ€d-uw+y3โ€)

Gaussian Mixture Model (GMM)

Probability from one state to another

Page 8: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

HMM โ€“ Emission Probability

โ€ข Too many states โ€ฆโ€ฆ

P(x|โ€t-d+uw3โ€)P(x|โ€d-uw+y3โ€)

Tied-state

Same Addresspointer pointer

็ต‚ๆฅตๅž‹ๆ…‹: Subspace GMM [Povey, et al., ICASSPโ€™10]

(Geoffrey Hinton also published deep learning for ASR in the same conference)

[Mohamed , et al., ICASSPโ€™10]

Page 9: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

P๐œƒ ๐‘‹|๐‘† =?

transition

๐‘ ๐‘|๐‘Ž

๐‘Ž ๐‘emission(GMM)

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘›(๐‘†)

๐‘ƒ ๐‘‹|โ„Ž

alignment

which state generates which vector

โ„Ž1 = ๐‘Ž๐‘Ž๐‘๐‘๐‘๐‘

๐‘Ž ๐‘ ๐‘

โ„Ž2 = ๐‘Ž๐‘๐‘๐‘๐‘๐‘

โ„Ž = ๐‘Ž๐‘๐‘๐‘๐‘๐‘

๐‘ƒ ๐‘‹|โ„Ž1

๐‘ƒ ๐‘‹|โ„Ž2

โ„Ž = ๐‘Ž๐‘๐‘๐‘๐‘

๐‘Ž ๐‘Ž ๐‘ ๐‘ ๐‘ ๐‘๐‘ƒ ๐‘Ž|๐‘Ž ๐‘ƒ ๐‘|๐‘Ž ๐‘ƒ ๐‘|๐‘ ๐‘ƒ ๐‘|๐‘ ๐‘ƒ ๐‘|๐‘

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐‘ƒ ๐‘ฅ1|๐‘Ž ๐‘ƒ ๐‘ฅ2|๐‘Ž ๐‘ƒ ๐‘ฅ3|๐‘ ๐‘ƒ ๐‘ฅ4|๐‘ ๐‘ƒ ๐‘ฅ5|๐‘ ๐‘ƒ ๐‘ฅ6|๐‘

Page 10: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

How to use Deep Learning?

Page 11: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Method 1: Tandem

โ€ฆโ€ฆ โ€ฆโ€ฆ

๐‘ฅ๐‘–

Size of output layer= No. of states DNN

โ€ฆโ€ฆ

โ€ฆโ€ฆ

Last hidden layer or bottleneck layer are also possible.

New acoustic feature for HMM

๐‘(๐‘Ž|๐‘ฅ๐‘–) ๐‘(๐‘|๐‘ฅ๐‘–) ๐‘(๐‘|๐‘ฅ๐‘–)

State classifier

Page 12: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Method 2: DNN-HMM Hybrid

DNN

๐‘ƒ ๐‘Ž|๐‘ฅโ€ฆโ€ฆ

๐‘ฅ

๐‘ƒ ๐‘ฅ|๐‘Ž

๐‘ƒ ๐‘ฅ|๐‘Ž =๐‘ƒ ๐‘ฅ, ๐‘Ž

๐‘ƒ ๐‘Ž=๐‘ƒ ๐‘Ž|๐‘ฅ ๐‘ƒ ๐‘ฅ

๐‘ƒ ๐‘ŽCount from training data

CNN, LSTM โ€ฆ

DNN output

Page 13: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

How to train a state classifier?

Train HMM-GMM model

state sequence: { a , b , c }

Acoustic features:

Utterance + Label(without alignment)

Utterance + Label(aligned)

state sequence:

Acoustic features:

a a a b b c c

Page 14: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

How to train a state classifier?

Train HMM-GMM model

Utterance + Label(without alignment)

Utterance + Label(aligned)

DNN1

Page 15: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

How to train a state classifier?

Utterance + Label(without alignment)

Utterance + Label(aligned)

DNN2DNN1realignment

Page 16: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Human Parity!

โ€ข ๅพฎ่ปŸ่ชž้Ÿณ่พจ่ญ˜ๆŠ€่ก“็ช็ ด้‡ๅคง้‡Œ็จ‹็ข‘๏ผšๅฐ่ฉฑ่พจ่ญ˜่ƒฝๅŠ›้”ไบบ้กžๆฐดๆบ–๏ผ(2016.10)

โ€ข https://www.bnext.com.tw/article/41414/bn-2016-10-19-020437-216

โ€ข IBM vs Microsoft: 'Human parity' speech recognition record changes hands again (2017.03)

โ€ข http://www.zdnet.com/article/ibm-vs-microsoft-human-parity-speech-recognition-record-changes-hands-again/

Machine 5.9% v.s. Human 5.9%

Machine 5.5% v.s. Human 5.1%

[Yu, et al., INTERSPEECHโ€™16]

[Saon, et al., INTERSPEECHโ€™17]

Page 17: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Very Deep

[Yu, et al., INTERSPEECHโ€™16]

Page 18: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Back to End-to-end

Page 19: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

LAS

๐‘ƒ ๐‘Œ|X =?

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹

๐‘ฅ4๐‘ฅ3๐‘ฅ2๐‘ฅ1

โ€ข LAS directly computes ๐‘ƒ ๐‘Œ|X

๐‘Ž ๐‘

๐‘ƒ ๐‘Œ|X = ๐‘ ๐‘Ž|๐‘‹ ๐‘ ๐‘|๐‘Ž, ๐‘‹ โ€ฆ๐‘ง0

๐‘0

๐‘ง1

Size V

๐‘1

๐‘ง1 ๐‘ง2

a

๐‘ ๐‘Ž ๐‘ ๐‘

๐‘2

๐‘ง3

b

๐‘ ๐ธ๐‘‚๐‘†

Beam Search

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Decoding:

Training:

Page 20: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

CTC, RNN-T

๐‘ƒ ๐‘Œ|X =?

๐‘ฅ4๐‘ฅ3๐‘ฅ2๐‘ฅ1

โ€ข LAS directly computes ๐‘ƒ ๐‘Œ|X

๐‘Ž ๐‘

๐‘ƒ ๐‘Œ|X = ๐‘ ๐‘Ž|๐‘‹ ๐‘ ๐‘|๐‘Ž, ๐‘‹ โ€ฆ

โ€ข CTC and RNN-T need alignment

Encoder

โ„Ž4โ„Ž3โ„Ž2โ„Ž1

โ„Ž = ๐‘Ž ๐œ™ ๐‘ ๐œ™๐‘ƒ โ„Ž|๐‘‹

๐‘ ๐‘Ž ๐‘ ๐‘๐‘ ๐œ™ ๐‘ ๐œ™

P Y|๐‘‹ =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

โ†’ ๐‘Ž ๐‘

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹Beam Search

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Decoding:

Training:

Page 21: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

HMM, CTC, RNN-T

P๐œƒ ๐‘‹|๐‘† =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘†

๐‘ƒ ๐‘‹|โ„Ž

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹

P๐œƒ Y|๐‘‹ =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

HMM CTC, RNN-T

๐œ•P๐œƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

1. Enumerate all the possible alignments

2. How to sum over all the alignments

3. Training:

4. Testing (Inference, decoding):

Page 22: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

HMM, CTC, RNN-T

๐‘ƒ ๐‘‹|๐‘† =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘†

๐‘ƒ ๐‘‹|โ„Ž

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹

๐‘ƒ Y|๐‘‹ =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

HMM CTC, RNN-T

๐œ•P๐œƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

1. Enumerate all the possible alignments

2. How to sum over all the alignments

3. Training:

4. Testing (Inference, decoding):

Page 23: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

All the alignments

HMM

CTC

RNN-T

LAS

ไฝ ๅ€‘ๅœจๅฟ™ไป€้บผโ˜บ

c a t c c a a a t c a a a a t

c ๐œ™ a a t t

add ๐œ™

duplicate to length ๐‘‡

๐œ™ c a ๐œ™ t ๐œ™

add ๐œ™ x ๐‘‡

c ๐œ™ ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™

c ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™ ๐œ™

๐‘‡ = 6

SpeechRecognition

speech text

c a t

๐‘ = 3

โ€ฆ

duplicatec a t

to length ๐‘‡

โ€ฆ

c a t โ€ฆโ€ฆ

๐‘Œ โ„Ž โˆˆ ๐‘Ž๐‘™๐‘–๐‘”๐‘›(๐‘Œ)

Page 24: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

HMM c a t c c a a a t c a a a a tduplicate to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

nexttoken

duplicate

For n = 1 to ๐‘

output the n-th token ๐‘ก๐‘› times

constraint: ๐‘ก1 + ๐‘ก2 +โ‹ฏ๐‘ก๐‘ = ๐‘‡, ๐‘ก๐‘› > 0Trellis Graph

c c c

a

a

t

aa

t

t t

Page 25: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

HMM c a t c c a a a t c a a a a tduplicate to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

nexttoken

duplicate

For n = 1 to ๐‘

output the n-th token ๐‘ก๐‘› times

constraint: ๐‘ก1 + ๐‘ก2 +โ‹ฏ๐‘ก๐‘ = ๐‘‡, ๐‘ก๐‘› > 0Trellis Graph

c c c c cc

Page 26: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

For n = 1 to ๐‘

output the n-th token ๐‘ก๐‘› times

constraint:

output โ€œ๐œ™โ€ ๐‘๐‘› times

๐‘ก๐‘› > 0 ๐‘๐‘› โ‰ฅ 0

output โ€œ๐œ™โ€ ๐‘0 times

๐‘ก1 + ๐‘ก2 +โ‹ฏ๐‘ก๐‘ +

๐‘0 + ๐‘1 +โ‹ฏ๐‘๐‘ = ๐‘‡

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

Page 27: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐œ™

c

๐œ™

a

๐œ™

t

๐œ™

next token

duplicate

insert ๐œ™

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

(๐œ™ can be skipped)

c

Page 28: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐œ™

c

๐œ™

a

๐œ™

t

๐œ™

duplicate ๐œ™

next token

cannot skip any token

๐œ™

Page 29: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐œ™

c

๐œ™

a

๐œ™

t

๐œ™

next token

duplicate

insert ๐œ™

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

duplicate

next token

Page 30: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐œ™

c

๐œ™

a

๐œ™

t

๐œ™

cc

a

t

๐œ™

a

๐œ™

t

๐œ™๐œ™

c

๐œ™

Page 31: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐œ™

c

๐œ™

a

๐œ™

t

๐œ™

c

a

t

๐œ™๐œ™

c c ca

t

c

๐œ™

Page 32: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐œ™

s

๐œ™

e

๐œ™

e

๐œ™

next token

duplicate

insert ๐œ™

โ€ฆ ee โ€ฆ โ†’ e

Exception: when the next token is the same token

c

๐œ™

Page 33: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

RNN-Tadd ๐œ™ x ๐‘‡

c ๐œ™ ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™

c ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™ ๐œ™c a t โ€ฆโ€ฆ

For n = 1 to ๐‘

output the n-th token 1 times

constraint:

output โ€œ๐œ™โ€ ๐‘๐‘› times

output โ€œ๐œ™โ€ ๐‘0 times

๐‘0 + ๐‘1 +โ‹ฏ๐‘๐‘ = ๐‘‡

c a t

Put some ๐œ™ (option)

Put some ๐œ™ (option)

Put some ๐œ™ (option)

Put some ๐œ™

at least once

๐‘๐‘ > 0

๐‘๐‘› โ‰ฅ 0 for n = 1 to ๐‘ โˆ’ 1

Page 34: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

RNN-Tadd ๐œ™ x ๐‘‡

c ๐œ™ ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™

c ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™ ๐œ™c a t โ€ฆโ€ฆ

output token

Insert ๐œ™

๐œ™

๐œ™

c

Page 35: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

output token

Insert ๐œ™

RNN-Tadd ๐œ™ x ๐‘‡

c ๐œ™ ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™

c ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™ ๐œ™c a t โ€ฆโ€ฆ

๐œ™c

๐œ™ ๐œ™๐‘Ž

๐‘ก๐œ™

๐‘ก๐œ™ ๐œ™

c

a

๐œ™ ๐œ™ ๐œ™ ๐œ™ ๐œ™ ๐œ™

Page 36: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

output token

Insert ๐œ™

RNN-Tadd ๐œ™ x ๐‘‡

c ๐œ™ ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™

c ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™ ๐œ™c a t โ€ฆโ€ฆ

๐œ™

๐œ™ ๐œ™ ๐œ™ ๐œ™ ๐œ™

๐‘ก

c

a

๐œ™

Page 37: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

c a tStart End

c a tStart End

๐œ™ ๐œ™ ๐œ™ ๐œ™

c a tStart End

๐œ™ ๐œ™ ๐œ™ ๐œ™

HMM

CTC

RNN-T

not gen not gen not gen

Page 38: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

1. Enumerate all the possible alignments

HMM, CTC, RNN-T

๐‘ƒ ๐‘‹|Y =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ ๐‘‹|โ„Ž

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹

๐‘ƒ Y|๐‘‹ =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

HMM CTC, RNN-T

๐œ•P๐œƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

2. How to sum over all the alignments

3. Training:

4. Testing (Inference, decoding):

Page 39: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

This part is challenging.

Page 40: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Score Computation

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

output token

Insert ๐œ™

๐œ™c

๐œ™ ๐œ™๐‘Ž

๐œ™๐‘ก๐œ™ ๐œ™

โ„Ž = ๐œ™ c ๐œ™ ๐œ™ a ๐œ™ t ๐œ™ ๐œ™

๐‘ƒ โ„Ž|๐‘‹

= ๐‘ƒ ๐œ™|๐‘‹

ร— ๐‘ƒ ๐‘|๐‘‹, ๐œ™

ร— ๐‘ƒ ๐œ™|๐‘‹, ๐œ™๐‘ โ€ฆโ€ฆ

Page 41: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐œ™

โ„Ž1 โ„Ž2 โ„Ž2 โ„Ž3 โ„Ž4 โ„Ž4 โ„Ž5 โ„Ž5 โ„Ž6

๐‘1,0

๐‘

๐‘2,0

๐œ™

๐‘2,1

๐œ™

๐‘3,1

๐‘Ž

๐‘4,1

๐œ™

๐‘4,2

๐‘ก

๐‘5,2

๐œ™

๐‘5,3

๐œ™

๐‘6,3

c a t<BOS>

๐‘™0 ๐‘™1 ๐‘™2 ๐‘™3

๐‘™0 ๐‘™0 ๐‘™1 ๐‘™1 ๐‘™1 ๐‘™2 ๐‘™2 ๐‘™3๐‘™3

โ„Ž = ๐œ™ c ๐œ™ ๐œ™ a ๐œ™ t ๐œ™ ๐œ™

Page 42: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Score Computation

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐‘1,0 ๐œ™

๐‘2,0 ๐‘

๐‘2,1 ๐œ™ ๐‘3,1 ๐œ™๐‘4,1 ๐‘Ž

๐‘4,2 ๐œ™๐‘5,2 ๐‘ก

๐‘5,3 ๐œ™ ๐‘6,3 ๐œ™

โ„Ž = ๐œ™ c ๐œ™ ๐œ™ a ๐œ™ t ๐œ™ ๐œ™

๐‘ƒ โ„Ž|๐‘‹

Page 43: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Score Computation

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

c<B

OS>

a

๐‘™ 0๐‘™ 1

๐‘™ 2

Because ๐œ™ is not considered!

๐‘4,2 ๐œ™

๐‘4,2 ๐‘Ž๐‘4,2

Page 44: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐œ™

โ„Ž1 โ„Ž2 โ„Ž2 โ„Ž3 โ„Ž4 โ„Ž4 โ„Ž5 โ„Ž5 โ„Ž6

๐‘1,0

๐‘

๐‘2,0

๐œ™

๐‘2,1

๐œ™

๐‘3,1

๐‘Ž

๐‘4,1

๐œ™

๐‘4,2

๐‘ก

๐‘5,2

๐œ™

๐‘5,3

๐œ™

๐‘6,3

c a t<BOS>

๐‘™0 ๐‘™1 ๐‘™2 ๐‘™3

๐‘™0 ๐‘™0 ๐‘™1 ๐‘™1 ๐‘™1 ๐‘™2 ๐‘™2 ๐‘™3๐‘™3

Page 45: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐œ™

โ„Ž4 โ„Ž5 โ„Ž5 โ„Ž6

๐‘ ๐œ™ ๐œ™ ๐‘Ž ๐œ™

๐‘4,2

๐‘ก

๐‘5,2

๐œ™

๐‘5,3

๐œ™

๐‘6,3

c a t<BOS>

๐‘™0 ๐‘™1 ๐‘™2 ๐‘™3

๐‘™2 ๐‘™2 ๐‘™3๐‘™3๐œ™ ๐‘๐œ™ ๐œ™ ๐‘Ž

๐œ™๐‘ ๐œ™ ๐œ™๐‘Ž

Page 46: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐›ผ4,2

๐›ผ4,2 = ๐›ผ4,1๐‘4,1 ๐‘Ž + ๐›ผ3,2๐‘3,2 ๐œ™

๐›ผ4,1

๐›ผ3,2

generate โ€œaโ€

read ๐‘ฅ4

generate โ€œ๐œ™ โ€,

๐›ผ๐‘–,๐‘—:the summation of the scores of all the alignmentsthat read i-th acoustic features and output j-th tokens

๐›ผ๐‘–,๐‘—

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

Page 47: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐›ผ4,2 = ๐›ผ4,1๐‘4,1 ๐‘Ž + ๐›ผ3,2๐‘3,2 ๐œ™

๐›ผ๐‘–,๐‘—:the summation of the scores of all the alignmentsthat read i-th acoustic features and output j-th tokens

You can compute summation of the scores of all the alignments.

Page 48: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

1. Enumerate all the possible alignments

HMM, CTC, RNN-T

P๐œƒ ๐‘‹|Y =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ ๐‘‹|โ„Ž

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹

P๐œƒ Y|๐‘‹ =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

HMM CTC, RNN-T

๐œ•P๐œƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

2. How to sum over all the alignments

3. Training:

4. Testing (Inference, decoding):

Page 49: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Training

๐œ™ c ๐œ™ ๐œ™ a ๐œ™ t ๐œ™ ๐œ™

๐‘1,0 ๐œ™ ๐‘2,0 ๐‘ ๐‘2,1 ๐œ™ ๐‘3,1 ๐œ™ ๐‘4,1 ๐‘Ž ๐‘4,2 ๐œ™ ๐‘5,2 ๐‘ก ๐‘5,3 ๐œ™ ๐‘6,3 ๐œ™

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”๐‘ƒ ๐‘Œ|๐‘‹

๐‘ƒ ๐‘Œ|๐‘‹ =

โ„Ž

๐‘ƒ โ„Ž|๐‘‹

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

Page 50: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐‘4,1 ๐‘Ž

๐‘3,2 ๐œ™

๐‘1,0 ๐œ™ ๐‘2,0 ๐‘ ๐‘2,1 ๐œ™ ๐‘3,1 ๐œ™ ๐‘4,1 ๐‘Ž ๐‘4,2 ๐œ™ ๐‘5,2 ๐‘ก ๐‘5,3 ๐œ™ ๐‘6,3 ๐œ™

๐‘ƒ ๐‘Œ|๐‘‹ =

โ„Ž

๐‘ƒ โ„Ž|๐‘‹

Each arrow is a component in ๐‘ƒ ๐‘Œ|๐‘‹ =

โ„Ž

๐‘ƒ โ„Ž|๐‘‹

Page 51: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Training

๐œ™ c ๐œ™ ๐œ™ a ๐œ™ t ๐œ™ ๐œ™

๐‘1,0 ๐œ™ ๐‘2,0 ๐‘ ๐‘2,1 ๐œ™ ๐‘3,1 ๐œ™ ๐‘4,1 ๐‘Ž ๐‘4,2 ๐œ™ ๐‘5,2 ๐‘ก ๐‘5,3 ๐œ™ ๐‘6,3 ๐œ™

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”๐‘ƒ ๐‘Œ|๐‘‹

๐‘ƒ ๐‘Œ|๐‘‹ =

โ„Ž

๐‘ƒ โ„Ž|๐‘‹

๐œƒ

๐‘4,1 ๐‘Ž

๐‘3,2 ๐œ™

โ€ฆโ€ฆ๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž+๐œ•๐‘3,2 ๐œ™

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘3,2 ๐œ™+โ‹ฏ

Page 52: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐œƒ

๐‘4,1 ๐‘Ž

๐‘3,2 ๐œ™

โ€ฆโ€ฆ๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

Each arrow is a component

๐œ•๐‘4,1 ๐‘Ž

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž+๐œ•๐‘3,2 ๐œ™

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘3,2 ๐œ™+โ‹ฏ

๐‘4,1 ๐‘Ž

๐‘3,2 ๐œ™

Page 53: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘1,0 ๐œ™

โ„Ž1 โ„Ž2 โ„Ž2 โ„Ž3 โ„Ž4

๐‘1,0

๐‘2,0 ๐‘

๐‘2,0

๐‘2,1 ๐œ™

๐‘2,1

๐‘3,1 ๐œ™

๐‘3,1

๐‘4,1 ๐‘Ž

๐‘4,1

c<BOS>

๐‘™0 ๐‘™1

๐‘™0 ๐‘™0 ๐‘™1 ๐‘™1 ๐‘™1

๐œ•๐‘4,1 ๐‘Ž

๐œ•๐œƒ=?

Backpropagation (through time)

To encoder

Page 54: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘ƒ ๐‘Œ|๐‘‹ =

โ„Ž ๐‘ค๐‘–๐‘กโ„Ž ๐‘4,1 ๐‘Ž

๐‘ƒ โ„Ž|๐‘‹ +

โ„Ž ๐‘ค๐‘–๐‘กโ„Ž๐‘œ๐‘ข๐‘ก ๐‘4,1 ๐‘Ž

๐‘ƒ โ„Ž|๐‘‹

=

โ„Ž ๐‘ค๐‘–๐‘กโ„Ž ๐‘4,1 ๐‘Ž

๐‘ƒ โ„Ž|๐‘‹

๐‘4,1 ๐‘Ž

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

๐œ•๐‘4,1 ๐‘Ž

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž+๐œ•๐‘3,2 ๐œ™

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘3,2 ๐œ™+โ‹ฏ

๐‘4,1 ๐‘Ž ร— ๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž

=1

๐‘4,1 ๐‘Ž

โ„Ž ๐‘ค๐‘–๐‘กโ„Ž ๐‘4,1 ๐‘Ž

๐‘ƒ โ„Ž|๐‘‹

=

โ„Ž ๐‘ค๐‘–๐‘กโ„Ž ๐‘4,1 ๐‘Ž

๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ

Page 55: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐›ฝ4,2

๐›ฝ4,2 = ๐›ฝ4,3๐‘4,2 ๐‘ก + ๐›ฝ5,2๐‘4,2 ๐œ™

generate โ€œtโ€

read ๐‘ฅ5

generate โ€œ๐œ™โ€

๐›ฝ๐‘–,๐‘—:the summation of the score of all the alignmentsstaring from i-th acoustic features and j-th tokens

๐›ฝ5,2

๐›ฝ4,3

Page 56: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐›ฝ4,2

๐›ผ4,1

๐‘4,1 ๐‘Ž

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž=

1

๐‘4,1 ๐‘Ž

๐‘Ž ๐‘ค๐‘–๐‘กโ„Ž ๐‘4,1 ๐‘Ž

๐‘ƒ โ„Ž|๐‘‹ ๐›ผ4,1 ๐‘4,1 ๐‘Ž ๐›ฝ4,2

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

๐œ•๐‘4,1 ๐‘Ž

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž+๐œ•๐‘3,2 ๐œ™

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘3,2 ๐œ™+โ‹ฏ๐›ผ4,1๐›ฝ4,2

Page 57: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

1. Enumerate all the possible alignments

HMM, CTC, RNN-T

P๐œƒ ๐‘‹|Y =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ ๐‘‹|โ„Ž

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹

P๐œƒ Y|๐‘‹ =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

HMM CTC, RNN-T

๐œ•P๐œƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

2. How to sum over all the alignments

3. Training:

4. Testing (Inference, decoding):

Page 58: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”P Y|๐‘‹

= ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

โ‰ˆ ๐‘Ž๐‘Ÿ๐‘”maxY

maxโ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘™๐‘œ๐‘”๐‘ƒ โ„Ž|๐‘‹

maxโ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹็†ๆƒณ

โ„Žโˆ— = ๐‘Ž๐‘Ÿ๐‘” maxโ„Ž

๐‘™๐‘œ๐‘”๐‘ƒ โ„Ž|๐‘‹

Yโˆ— = ๐‘Ž๐‘™๐‘–๐‘”๐‘›โˆ’1 โ„Žโˆ—

Decoding

โ„Ž = ๐œ™ c ๐œ™ ๐œ™ a ๐œ™ t ๐œ™ ๐œ™ โ€ฆ

๐‘ƒ โ„Ž|๐‘‹ = ๐‘ƒ โ„Ž1|๐‘‹ ๐‘ƒ โ„Ž2|๐‘‹, โ„Ž1 ๐‘ƒ โ„Ž3|๐‘‹, โ„Ž1, โ„Ž2 โ€ฆ

โ„Ž1 โ„Ž2 โ„Ž3

็พๅฏฆ

Page 59: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

โ„Žโˆ— = ๐‘Ž๐‘Ÿ๐‘” maxโ„Ž

๐‘™๐‘œ๐‘”๐‘ƒ โ„Ž|๐‘‹

๐œ™

โ„Ž1 โ„Ž2 โ„Ž2 โ„Ž3 โ„Ž4 โ„Ž4 โ„Ž5 โ„Ž5 โ„Ž6

๐‘1,0

๐‘

๐‘2,0

๐œ™

๐‘2,1

๐œ™

๐‘3,1

๐‘Ž

๐‘4,1

๐œ™

๐‘4,2

๐‘ก

๐‘5,2

๐œ™

๐‘5,3

๐œ™

๐‘6,3

c a t<BOS>

๐‘™0 ๐‘™1 ๐‘™2 ๐‘™3

๐‘™0 ๐‘™0 ๐‘™1 ๐‘™1 ๐‘™1 ๐‘™2 ๐‘™2 ๐‘™3๐‘™3

max

โ„Žโˆ—

Beam Search for better

approximation

Page 60: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Summary

LAS CTC RNN-T

Decoder dependent independent dependent

Alignment not explicit (soft alignment)

Yes Yes

Training just train it sum over alignment

sum over alignment

On-line No Yes Yes

Page 61: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โˆ—= ๐‘Ÿ๐‘”max Y Y| = ๐‘Ÿ๐‘”max Y |Y Y = ๐‘Ÿ๐‘”max Y |Y Y |Y: Acoustic

Reference

โ€ข [Yu, et al., INTERSPEECHโ€™16] Dong Yu, Wayne Xiong, Jasha Droppo, Andreas Stolcke , Guoli Ye, Jinyu Li , Geoffrey Zweig, Deep Convolutional Neural Networks with Layer-wise Context Expansion and Attention, INTERSPEECH, 2016

โ€ข [Saon, et al., INTERSPEECHโ€™17] George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, BhuvanaRamabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall, English Conversational Telephone Speech Recognition by Humans and Machines, INTERSPEECH, 2017

โ€ข [Povey, et al., ICASSPโ€™10] Daniel Povey, Lukas Burget, Mohit Agarwal, Pinar Akyazi, Kai Feng, Arnab Ghoshal, Ondrej Glembek, Nagendra Kumar Goel, Martin Karafiat, Ariya Rastrow, Richard C. Rose, Petr Schwarz, Samuel Thomas, Subspace Gaussian Mixture Models for speech recognition, ICASSP, 2010

โ€ข [Mohamed , et al., ICASSPโ€™10] Abdel-rahman Mohamed and Geoffrey Hinton, Phone recognition using Restricted Boltzmann Machines, ICASSP, 2010