68
Deep Learning for Speech Recognition Hung-yi Lee

Deep Learning for Speech - 國立臺灣大學

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning for Speech - 國立臺灣大學

Deep Learning for Speech Recognition

Hung-yi Lee

Page 2: Deep Learning for Speech - 國立臺灣大學

Outline

• Conventional Speech Recognition

• How to use Deep Learning in acoustic modeling?

• Why Deep Learning?

• Speaker Adaptation

• Multi-task Deep Learning

• New acoustic features

• Convolutional Neural Network (CNN)

• Applications in Acoustic Signal Processing

Page 3: Deep Learning for Speech - 國立臺灣大學

Conventional Speech Recognition

Page 4: Deep Learning for Speech - 國立臺灣大學

Machine Learning helps

天氣真好Speech

Recognition 𝑊X

= 𝑎𝑟𝑔max𝑊

𝑃 𝑊|𝑋

= 𝑎𝑟𝑔max𝑊

𝑃 𝑋|𝑊 𝑃 𝑊

𝑃 𝑋= 𝑎𝑟𝑔max

𝑊𝑃 𝑋|𝑊 𝑃 𝑊

𝑃 𝑋|𝑊 : Acoustic Model, 𝑃 𝑊 : Language Model

This is a structured learning problem.

Evaluation function: F 𝑋,𝑊 = 𝑃 𝑊|𝑋

Inference: 𝑊 = 𝑎𝑟𝑔max𝑊

𝐹 𝑋,𝑊

Page 5: Deep Learning for Speech - 國立臺灣大學

X:

Input Representation

• Audio is represented by a vector sequence

39 dim MFCC

……

……x1 x2 x3

X:

Page 6: Deep Learning for Speech - 國立臺灣大學

Input Representation - Splice

• To consider some temporal information

MFCC

Splice

…… ……

…… ……

Page 7: Deep Learning for Speech - 國立臺灣大學

Phoneme

• Phoneme: basic unit

Each word corresponds to a sequence of phonemes.

hh w aa t d uw y uw th ih ng k

what do you think

Different words can correspond to the same phonemes

Lexicon

Lexicon

Page 8: Deep Learning for Speech - 國立臺灣大學

State

• Each phoneme correspond to a sequence of states

hh w aa t d uw y uw th ih ng k

what do you think

t-d+uw1 t-d+uw2 t-d+uw3

…… t-d+uw d-uw+y uw-y+uw y-uw+th ……

d-uw+y1 d-uw+y2 d-uw+y3

Phone:

Tri-phone:

State:

Page 9: Deep Learning for Speech - 國立臺灣大學

State

• Each state has a stationary distribution for acoustic features

t-d+uw1

d-uw+y3

P(x|”t-d+uw1”)

P(x|”d-uw+y3”)

Gaussian Mixture Model (GMM)

Page 10: Deep Learning for Speech - 國立臺灣大學

State

• Each state has a stationary distribution for acoustic features

P(x|”t-d+uw1”)P(x|”d-uw+y3”)

Tied-state

Same Address

pointer pointer

Page 11: Deep Learning for Speech - 國立臺灣大學

Acoustic Model

𝑊 = 𝑎𝑟𝑔max𝑊

𝑃 𝑋|𝑊 𝑃 𝑊

X: ……

Assume we also know the alignment 𝑠1⋯𝑠𝑇.

x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

𝑃 𝑋|𝑆, ℎ =

𝑡=1

𝑇

𝑃 𝑠𝑡|𝑠𝑡−1 𝑃 𝑥𝑡|𝑠𝑡

P(X|W) = P(X|S)

……

transition

emission

what do you think?W:

a b c d e ……S:

Page 12: Deep Learning for Speech - 國立臺灣大學

Acoustic Model

𝑊 = 𝑎𝑟𝑔max𝑊

𝑃 𝑋|𝑊 𝑃 𝑊

X: ……x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

P(X|W) = P(X|S)what do you think?W:

a b c d e ……S:

Actually, we don’t know the alignment.

𝑃 𝑋|𝑆 ≈ max𝑠1⋯𝑠𝑇

𝑡=1

𝑇

𝑃 𝑠𝑡|𝑠𝑡−1 𝑃 𝑥𝑡|𝑠𝑡(Viterbi algorithm)

Page 13: Deep Learning for Speech - 國立臺灣大學

How to use Deep Learning?

Page 14: Deep Learning for Speech - 國立臺灣大學

People imagine ……

This can not be true!

…… ……

Acoustic features

DNN can only take fixed length vectors as input.

DNN

“Hello”

Page 15: Deep Learning for Speech - 國立臺灣大學

What DNN can do is ……

…… ……

xi

Size of output layer= No. of states

P(a|xi)

DNN

DNN input:

One acoustic feature

DNN output:

Probability of each state

P(b|xi) P(c|xi) ……

……

Page 16: Deep Learning for Speech - 國立臺灣大學

Low rank approximation

W

Input layer

Output layerM

N

W: M X N

M can be large if the outputs are the states of tri-phone.

N is the size of the last hidden layer

M is the size of output layer

Number of states

Page 17: Deep Learning for Speech - 國立臺灣大學

Low rank approximation

W U

V

W linear

≈M M

N NK

K

U

V

Less parameters

M

N

M

N

K

K < M,N

Output layer Output layer

Page 18: Deep Learning for Speech - 國立臺灣大學

How we use deep learning

• There are three ways to use DNN for acoustic modeling

• Way 1. Tandem

• Way 2. DNN-HMM hybrid

• Way 3. End-to-end

Efforts for exploiting deep learning

Page 19: Deep Learning for Speech - 國立臺灣大學

How to use Deep Learning?Way 1: Tandem

Page 20: Deep Learning for Speech - 國立臺灣大學

Way 1: Tandem system

…… ……

xi

Size of output layer= No. of states DNN

……

……

Last hidden layer or bottleneck layer are also possible.

new feature 𝑥𝑖′

Input of your original speech recognition system

P(a|xi) P(b|xi) P(c|xi)

Page 21: Deep Learning for Speech - 國立臺灣大學

How to use Deep Learning?Way 2: DNN-HMM hybrid

Page 22: Deep Learning for Speech - 國立臺灣大學

Way 2: DNN-HMM Hybrid

𝑊 = 𝑎𝑟𝑔max𝑊

𝑃 𝑊|𝑋 = 𝑎𝑟𝑔max𝑊

𝑃 𝑋|𝑊 𝑃 𝑊

𝑃 𝑋|𝑊 ≈ max𝑠1⋯𝑠𝑇

𝑡=1

𝑇

𝑃 𝑠𝑡|𝑠𝑡−1 𝑃 𝑥𝑡|𝑠𝑡 From DNN

DNN

𝑃 𝑠𝑡|𝑥𝑡……

𝑥𝑡

𝑃 𝑥𝑡|𝑠𝑡 =𝑃 𝑥𝑡 , 𝑠𝑡𝑃 𝑠𝑡

=𝑃 𝑠𝑡|𝑥𝑡 𝑃 𝑥𝑡

𝑃 𝑠𝑡 Count from training data

Page 23: Deep Learning for Speech - 國立臺灣大學

Way 2: DNN-HMM Hybrid

𝑃 𝑋|𝑊 ≈ max𝑠1⋯𝑠𝑇

𝑡=1

𝑇

𝑃 𝑠𝑡|𝑠𝑡−1𝑃 𝑠𝑡|𝑥𝑡𝑃 𝑠𝑡

Count from training data

From original HMM

This assembled vehicle works …….

DNN

DNN

𝑃 𝑠𝑡|𝑥𝑡……

𝑥𝑡

𝜃𝑃 𝑋|𝑊; 𝜃

𝑃 𝑋|𝑊 ≈ max𝑠1⋯𝑠𝑇

𝑡=1

𝑇

𝑃 𝑠𝑡|𝑠𝑡−1 𝑃 𝑥𝑡|𝑠𝑡

Page 24: Deep Learning for Speech - 國立臺灣大學

Way 2: DNN-HMM Hybrid

• Sequential Training

𝑊 = 𝑎𝑟𝑔max𝑊

𝑃 𝑋|𝑊; 𝜃 𝑃 𝑊

Find-tune the DNN parameters 𝜃 such that

Given training data 𝑋1, 𝑊1 , 𝑋2, 𝑊2 , ⋯ 𝑋𝑟 , 𝑊𝑟 , ⋯

𝑃 𝑋𝑟| 𝑊𝑟; 𝜃 𝑃 𝑊𝑟

𝑃 𝑋𝑟|𝑊; 𝜃 𝑃 𝑊

(𝑊 is any word sequence different from 𝑊𝑟)

increase

decrease

Page 25: Deep Learning for Speech - 國立臺灣大學

How to use Deep Learning?Way 3: End-to-end

Page 26: Deep Learning for Speech - 國立臺灣大學

Way 3: End-to-end - Character

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv:1412.5567v2, 2014.

Output: characters (and space)

Input: acoustic features (spectrograms)

No phoneme and lexicon

(No OOV problem)

+ null (~)

Page 27: Deep Learning for Speech - 國立臺灣大學

Way 3: End-to-end - Character

Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with

recurrent neural networks." Proceedings of the 31st International Conference on

Machine Learning (ICML-14). 2014.

HIS FRIEND’S

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~ ~ ~ ~ ~

Page 28: Deep Learning for Speech - 國立臺灣大學

Way 3: End-to-end – Word?

Ref: Bengio, Samy, and Georg Heigold. "Word embeddings for speech

recognition.“, Interspeech. 2014.

… …

DNN

Input layer

Output layer

~50k words in the lexicon Size = 200 times of

acoustic features

padding with zero!

Use other systems to get the word boundaries

apple bog cat ……

Page 29: Deep Learning for Speech - 國立臺灣大學

Why Deep Learning?

Page 30: Deep Learning for Speech - 國立臺灣大學

Deeper is better?

• Word error rate (WER)

Seide, Frank, Gang Li, and Dong Yu.

"Conversational Speech Transcription

Using Context-Dependent Deep Neural

Networks." Interspeech. 2011.

1 hidden layermultiple layers

Deeper is Better

Page 31: Deep Learning for Speech - 國立臺灣大學

Deeper is better?

• Word error rate (WER)

Seide, Frank, Gang Li, and Dong Yu.

"Conversational Speech Transcription

Using Context-Dependent Deep Neural

Networks." Interspeech. 2011.

1 hidden layermultiple layers

For a fixed number of parameters, a deep model is clearly better than the shallow one.

Deeper is Better

Page 32: Deep Learning for Speech - 國立臺灣大學

What does DNN do?

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC)

A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.

1-st Hidden Layer

Page 33: Deep Learning for Speech - 國立臺灣大學

What does DNN do?

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC) 8-th Hidden Layer

A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.

Page 34: Deep Learning for Speech - 國立臺灣大學

What does DNN do?

• In ordinary acoustic models, all the states are modeled independently

• Not effective way to model human voice

high

low

front back

The sound of vowel is only controlled by a few factors.

http://www.ipachart.com/

Page 35: Deep Learning for Speech - 國立臺灣大學

high

low

front back

/a/

/i/

/e/ /o/

/u/

Output of hidden layer reduce to two dimensions

The lower layers detect the manner of articulation

All the states share the results from the same set of detectors.

Use parameters effectively

What does DNN do?Vu, Ngoc Thang, Jochen Weiner,

and Tanja Schultz. "Investigating

the Learning Effect of Multilingual

Bottle-Neck Features for

ASR." Interspeech. 2014.

Page 36: Deep Learning for Speech - 國立臺灣大學

Speaker Adaptation

Page 37: Deep Learning for Speech - 國立臺灣大學

Speaker Adaptation

• Speaker adaptation: use different models to recognition the speech of different speakers

• Collect the audio data of each speaker

• A DNN model for each speaker

• Challenge: limited data for training

• Not enough data for directly training a DNN model

• Not enough data for just fine-tune a speaker independent DNN model

Page 38: Deep Learning for Speech - 國立臺灣大學

Categories of Methods

• Re-train the whole DNN with some constraints

Conservative training

• Only train the parameter of one layer

Transformation methods

• Do not really change the DNN parameters

Speaker-aware Training

Need less training data

Page 39: Deep Learning for Speech - 國立臺灣大學

Conservative Training

Audio data of Many speakers

A little data from target speaker

Input layer

Output layer

initialization

Input layer

Output layeroutput close

parameter close

Page 40: Deep Learning for Speech - 國立臺灣大學

Transformation methods

W

Input layer

Output layer

layer i

layer i+1

Add an extra layer

W

Input layer

Output layer

layer i

layer i+1

extra layerWa

Fix all the other parameters

A little data from target speaker

Page 41: Deep Learning for Speech - 國立臺灣大學

Transformation methods

• Add the extra layer between the input and first layer

• With splicing

layer 1

extra layer

Wa

……

layer 1

extra layers

Wa WaWa

……

Larger Wa Smaller WaMore data less data

Page 42: Deep Learning for Speech - 國立臺灣大學

Transformation methods

• SVD bottleneck adaptation

linear

U

V

M

N

K

Output layer

Wa

K is usually small

Page 43: Deep Learning for Speech - 國立臺灣大學

Speaker-aware Training

Lots of mismatched data

Text transcription is not needed for extracting the vectors.

Data of Speaker 1

Data of Speaker 2

Data of Speaker 3

Fixed length low dimension vectors

Can also be noise-aware, devise aware training, ……

Speaker information

Page 44: Deep Learning for Speech - 國立臺灣大學

Speaker-aware Training

Speaker 1

Speaker 2

Training data:

Testing data:

Acoustic features are appended with speaker information features

Acousticfeature

Output layer

SpeakerInformation

All the speaker use the same DNN model

Different speaker augmented by different features

train

test

Page 45: Deep Learning for Speech - 國立臺灣大學

Multi-task Learning

Page 46: Deep Learning for Speech - 國立臺灣大學

Multitask Learning

• The multi-layer structure makes DNN suitable for multitask learning

Input feature

Task A Task B

Task A Task B

Input feature for task A

Input feature for task B

Page 47: Deep Learning for Speech - 國立臺灣大學

Multitask Learning - Multilingual

acoustic features

states of French

states of German

Human languages share some common characteristics.

states of Spanish

states of Italian

states of Mandarin

Page 48: Deep Learning for Speech - 國立臺灣大學

Multitask Learning - Multilingual

Huang, Jui-Ting, et al. "Cross-language knowledge transfer using

multilingual deep neural network with shared hidden layers." Acoustics,

Speech and Signal Processing (ICASSP), 2013

25

30

35

40

45

50

1 10 100 1000

Ch

arac

ter

Erro

r R

ate

Hours of training data for Mandarin

Mandarin only

With European Language

Page 49: Deep Learning for Speech - 國立臺灣大學

Multitask Learning - Different units

acoustic features

A= state

A B

B = phoneme

A= state B = gender

A= state B = grapheme

(character)

Page 50: Deep Learning for Speech - 國立臺灣大學

Deep Learning for Acoustic Modeling

New acoustic features

Page 51: Deep Learning for Speech - 國立臺灣大學

MFCC

DFT

DCT log

MFCC

………

Input of DNN

Waveform

spectrogram

filter bank

Page 52: Deep Learning for Speech - 國立臺灣大學

Filter-bank Output

DFT

……

Input of DNN

Kind of standard now

log

Waveform

spectrogram

filter bank

Page 53: Deep Learning for Speech - 國立臺灣大學

Spectrogram

DFT

Input of DNN

5% relative improvement over fitlerbank output

Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., “Learning filter banks within a deep neural network framework,” In Automatic Speech Recognition and Understanding (ASRU), 2013

Waveform

spectrogram

common today

Page 54: Deep Learning for Speech - 國立臺灣大學

Spectrogram

Page 55: Deep Learning for Speech - 國立臺灣大學

Waveform?

Input of DNN

People tried, but not better than spectrogram yet

Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” In INTERPSEECH 2014

Still need to take Signal & Systems ……

Waveform

If success, no Signal & Systems

Page 56: Deep Learning for Speech - 國立臺灣大學

Waveform?

Page 57: Deep Learning for Speech - 國立臺灣大學

Convolutional Neural Network (CNN)

Page 58: Deep Learning for Speech - 國立臺灣大學

CNN

• Speech can be treated as images

Spectrogram

Time

Freq

uen

cy

Page 59: Deep Learning for Speech - 國立臺灣大學

CNN

Probabilities of states

CNN

Image

CNN

Replace DNN by CNN

Page 60: Deep Learning for Speech - 國立臺灣大學

CNN

a2

a1 b1

b2

MaxMax

Max

Max pooling

Maxout

Page 61: Deep Learning for Speech - 國立臺灣大學

CNN Tóth, László. "Convolutional Deep Maxout Networks for Phone

Recognition”, Interspeech, 2014.

Page 62: Deep Learning for Speech - 國立臺灣大學

Applications in Acoustic Signal Processing

Page 63: Deep Learning for Speech - 國立臺灣大學

DNN for Speech Enhancement

Demo for speech enhancement: http://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html

Noisy Speech

Clean Speech

for mobile commination or speech recognition

DNN

Page 64: Deep Learning for Speech - 國立臺灣大學

DNN for Voice Conversion

Demo for Voice Conversion: http://research.microsoft.com/en-us/projects/vcnn/default.aspx

Male

Female

DNN

Page 65: Deep Learning for Speech - 國立臺灣大學

Concluding Remarks

Page 66: Deep Learning for Speech - 國立臺灣大學

Concluding Remarks

• Conventional Speech Recognition

• How to use Deep Learning in acoustic modeling?

• Why Deep Learning?

• Speaker Adaptation

• Multi-task Deep Learning

• New acoustic features

• Convolutional Neural Network (CNN)

• Applications in Acoustic Signal Processing

Page 67: Deep Learning for Speech - 國立臺灣大學

Thank you for your attention!

Page 68: Deep Learning for Speech - 國立臺灣大學

More Researches related to Speech

你好Speech

Recognition

lecture recordings

Summary

Hi

Hello

Computer Assisted Language Learning

I would like to leave Taipei on November 2nd

DialogueInformation Extraction

Speech SummarizationSpoken Content Retrieval

Find the lectures

related to “deep

learning”

core techniques