60
Fundamentals of Speech Signal Processing

Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Embed Size (px)

Citation preview

Page 1: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Fundamentals of Speech Signal Processing

Page 2: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

1.0 Speech Signals

Page 3: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Waveform plots of typical vowel sounds - Voiced(濁音)

tone 1

tone 2

tone 4

t

 

Page 4: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Speech Production and Source Model

tu

• Human vocal mechanism • Speech Source Model

Vocaltract

tx

u(t)

x(t)

Page 5: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Voiced and Unvoiced Speech

u(t) x(t)

pitch pitch

voiced

unvoiced

Page 6: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Waveform plots of typical consonant sounds

Unvoiced (清音) Voiced (濁音)

Page 7: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Waveform plot of a sentence

Page 8: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Frequency domain spectra of speech signals

Voiced Unvoiced

Page 9: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Frequency Domain

Voiced 

formant frequencies

Page 10: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Unvoiced

formant frequencies

Frequency Domain

Page 11: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Spectrogram

Page 12: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Spectrogram

Page 13: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Formant Frequencies

Page 14: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Formant frequency contours

He will allow a rare lie.

Reference: 6.1 of Huang, or 2.2, 2.3 of Rabiner and Juang

Page 15: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

2.0 Speech Signal Processing

Page 16: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Speech Signal Processing

• Major Application Areas1. Speech Coding:Digitization and Compression

Considerations : 1) bit rate (bps) 2) recovered quality 3) computation

complexity/feasibility

2. Voice-based Network Access —

User Interface, Content Analysis, User-content Interaction

LPF outputProcessing Algorithms

x(t) x[n]

Processing xk

110101…Inverse

Processing

x[n] x[n]^

Storage/transmission

• Speech Signals– Carrying Linguistic

Knowledge and Human Information: Characters, Words, Phrases, Sentences, Concepts, etc.

– Double Levels of Information: Acoustic Signal Level/Symbolic or Linguistic Level

– Processing and Interaction of the Double-level Information

Page 17: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Sampling of Signals

X(t)X[n]

t n

Page 18: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Double Levels of Information

字 (Character)

詞 (Word)

句 (Sentence)人 人 用 電 腦電 腦

Page 19: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Speech Signal Processing – Processing of Double-Level Information• Speech Signal • Sampling • Processing

• Linguistic Structure

• Linguistic Knowledge

今 天 的

常 好

Lexicon Grammar今天 的

今天的 天氣 非常 好

Algorithm

Chips or Computers 天 氣 非

Page 20: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Voice-based Network Access

Content AnalysisUser Interface

Internet

User-Content Interaction

User Interface

—when keyboards/mice inadequate Content Analysis — help in browsing/retrieval of multimedia content User-Content Interaction —all text-based interaction can be accomplished by spoken language

Page 21: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

User Interface —Wireless Communications Technologies are Creating a Whole Variety of User Terminals

at Any Time, from Anywhere Smart phones, Hand-held Devices, Notebooks, Vehicular Electronics, Hands-

free Interfaces, Home Appliances, Wearable Devices… Small in Size, Light in Weight, Ubiquitous, Invisible… Post-PC Era Keyboard/Mouse Most Convenient for PC’s not Convenient any longer

— human fingers never shrink, and application environment is changed Service Requirements Growing Exponentially Voice is the Only Interface Convenient for ALL User Terminals at Any Time,

from Anywhere, and to the point in one utterance Speech Processing is the only less mature part in the Technology Chain

Internet Networks

Text Content

Multimedia Content

Page 22: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Content Analysis—Multimedia Technologies are Creating a New World of Multimedia Content

• Most Attractive Form of the Network Content will be in Multimedia, which usually Includes Speech Information (but Probably not Text)

• Multimedia Content Difficult to be Summarized and Shown on the Screen, thus Difficult to Browse

• The Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of the Multimedia Content, thus Becomes the Key for Browsing and Retrieval

• Multimedia Content Analysis based on Speech Information

Future Integrated Networks

Real–time Information– weather, traffic– flight schedule– stock price– sports scores

Special Services– Google– FaceBook–YouTube– Amazon

Knowledge Archieves– digital libraries– virtual museums

Intelligent Working Environment– e–mail processors– intelligent agents– teleconferencing– distant learning– electric commerce

Private Services– personal notebook– business databases– home appliances– network entertainments

Page 23: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

User-Content Interaction — Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processing

voice information Multimedia

ContentInternet

voice

input/

output

text information

• Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Speech

• User-Content Interaction can be Accomplished by Spoken and Multi-modal Dialogues• Hand-held Devices with Multimedia Functionalities Commonly used Today• Using Speech Instructions to Access Multimedia Content whose Key Concepts Specified

by Speech Information

Multimedia Content Analysis

Text Information Retrieval

Text Content

Voice-based Information

Retrieval

Text-to-Speech Synthesis

Spoken and multi-modal

Dialogue

Page 24: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

3.0 Speech Coding

Page 25: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Waveform-based Approaches

• Pulse-Coded Modulation (PCM)

– binary representation for each sample x[n] by quantization

• Differential PCM (DPCM)

– encoding the differences

d[n] = x[n] x[n1]

d[n] = x[n] ak x[nk]

• Adaptive DPCM (ADPCM)

– with adaptive algorithms

Ref : Haykin, “Communication Systems”, 4-th Ed.

3.7, 3.13, 3.14, 3.15

P

k=1

Page 26: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Speech Source Model and Source Coding

•Speech Source Model

– digitization and transmission of the parameters will be adequate– at receiver the parameters can produce x[n] with the model– much less parameters with much slower variation in time lead to much less

bits required– the key for low bit rate speech coding

x[n]u[n]

parameters parameters

Excitation Generator

Vocal Tract Model

Ex G(),G(z), g[n]

x[n]=u[n]g[n]X()=U()G()X(z)=U(z)G(z)

U ()U (z)

Page 27: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Speech Source Model

x(t)

a[n]

t

n

Page 28: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Speech Source Model and Source Coding

Analysis and Synthesis

– High computation requirements are the price for low bit rate

Page 29: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Simplified Speech Source Model

– Excitation parameters v/u : voiced/ unvoiced N : pitch for voiced G : signal gain excitation signal u[n]

unvoiced

voiced

randomsequencegenerator

periodic pulse traingenerator

x[n]

G(z) = 1

1 akz-k P

k = 1

Excitation

G(z), G(), g[n]

Vocal Tract Model

u[n]

Gv/u

N

– Vocal Tract parameters {ak} : LPC coefficients formant structure of speech signals– A good approximation, though not precise enough

Reference: 3.3.1-3.3.6 of Rabiner and Juang, or 6.3 of Huang

Page 30: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

LPC Vocoder(Voice Coder)

N by pitch detectionv/u by voicing detection

{ak} can be non-uniform or vector quantized to reduce bit rate further

Ref : 3.3 ( 3.3.1 up to 3.3.9 ) of Rabiner and Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 1993

Page 31: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Multipulse LPC

poor modeling of u(n) is the main source of quality degradation in LPC vocoder u[n] replaced by a sequence of pulses

u[n] = bk [nnk]

– roughly 8 pulses per pitch period

– u[n] close to periodic for voiced

– u[n] close to random for unvoiced

Estimating (bk , nk) is a difficult problem

k

Page 32: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Multipulse LPC

Estimating (bk , nk) is a difficult problem– analysis by synthesis

– large amount of computation is the price paid for better speech quality

Page 33: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Multipulse LPC

Perceptual Weighting

W(z) = (1 ak z -k) / (1 ak ck z -k)

0 < c < 1

for perceptual sensitivity

W(z) = 1 , if c = 1

W(z) = 1 ak z -k , if c = 0

practically c 0.8

Error Evaluation

    E = |X() X() |2 W()d

P

k = 1

P

k = 1

2

0

^

Page 34: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Multipulse LPC

Error Minimization and Pulse search

u[n] =  bk [nnk]

x[n] =  bk g[nnk]

E = E ( b1 , n1 ,b2 , n2……)

– sub-optional solution finding 1 pulse at a time

^

k

k

Page 35: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Code-Excited Linear Prediction (CELP)

Use of VQ to Construct a Codebook of Excitation Sequences

– a sequence consists of roughly 40 samples

– a codebook of 512 ~ 1024 patterns is constructed with VQ

– roughly 512 ~ 1024 excitation patterns are perceptually adequate

Excitation Search –analysis by synthesis

– 9 ~ 10 bits are needed for the excitation of 40 samples , while {ak} parameters in G(z) also vector quantized

Page 36: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Code-Excited Linear Prediction (CELP)

Receiver

– {ak} codewords can be transmitted less frequently than excitation codeword

Ref : Gold and Morgan, “Speech and Audio Signal Processing”, John Wiley & Sons, 2000, Chap 33

Page 37: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

4.0 Speech Recognition andVoice-based Network Access

Page 38: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Feature Extraction

unknown speech signal

Pattern Matching

Decision Making

x(t)WX

output wordfeature

vector sequence

Reference Patterns

Feature Extraction

y(t) Y

training speech

Speech Recognition as a pattern recognition problem

Page 39: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

• A Simplified Block Diagram

• Example Input Sentence this is speech• Acoustic Models (聲學模型 ) (th-ih-s-ih-z-s-p-ih-ch)• Lexicon (th-ih-s) → this (ih-z) → is (s-p-iy-ch) → speech• Language Model (語言模型 ) (this) – (is) – (speech)

P(this) P(is | this) P(speech | this is) P(wi|wi-1) bi-gram language model

P(wi|wi-1,wi-2) tri-gram language model,etc

Basic Approach for Large Vocabulary Speech Recognition

Front-endSignal Processing

AcousticModels Lexicon

FeatureVectors

Linguistic Decoding and

Search Algorithm

Output Sentence

SpeechCorpora

AcousticModel

Training

LanguageModel

Construction

TextCorpora

LanguageModel

Input Speech

Page 40: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

2.0

2.0 Fundamentals of Speech Recognition

Hidden Markov Models (HMM)

˙ Formulation Ot= [x1, x2, …xD]T feature vectors for frame at time t qt= 1,2,3…N state number for feature vector Ot

A =[ aij ] , aij = Prob[ qt = j | qt-1 = i ] state transition probability

B =[ bj(o), j = 1,2,…N] observation probability

bj(o) = cjkbjk(o)

bjk(o): multi-variate Gaussian distribution for the k-th mixture of the j-th state M : total number of mixtures

cjk = 1

= [ 1, 2, …N ] initial probabilities i = Prob[q1= i]

HMM : ( A , B, ) =

a22 a11

M

k = 1

M

k = 1

o1 o2 o3 o4 o5 o6 o7 o8…………

q1 q2 q3 q4 q5 q6 q7 q8…………

observation sequence state sequence

b1(o) b2 (o) b3(o)

a13 a24

a23 states a12

Page 41: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Observation Sequences

     

Page 42: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

1-dim Gaussian Mixtures

State Transition Probabilities

Page 43: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

2.0

Hidden Markov Models (HMM)

˙ Double Layers of Stochastic Processes

- hidden states with random transitions for time warping

- random output given state for random acoustic characteristics

˙ Three Basic Problems

(1) Evaluation Problem:

Given O =(o1, o2, …ot…oT) and = (A, B, )

find Prob [ O | ]

(2) Decoding Problem:

Given O = (o1, o2, …ot…oT) and = (A, B, )

find a best state sequence q = (q1,q2,…qt,…qT)

(3) Learning Problem:

Given O, find best values for parameters in

such that Prob [ O | ] = max

Page 44: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Simplified HMM

RGBGGBBGRRR……

Page 45: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Peripheral Processing for Human Perception (P.34 of 7.0 )

Page 46: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

2.0

Feature Extraction (Front-end Signal Processing)

˙ Mel Frequency Cepstral Coefficients (MFCC)

- Mel-scale Filter Bank triangular shape in frequency/overlapped uniformly spaced below 1 kHz

logarithmic scale above 1 kHz

˙ Delta Coefficients

- 1st/2nd order differences

Discrete Fourier Transform

windowed speech samples

MFCC Mel-scale Filter Bank

log( | |2 )

Inverse Discrete Fourier Transform

Page 47: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Mel-scale Filter Bank

Page 48: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

2.0

Language Modeling: N-gram

W = (w1, w2, w3,…,wi,…wR) a word sequence

- Evaluation of P(W)

P(W) = P(w1) II P(wi|w1, w2,…wi-1)

- Assumption:

P(wi|w1, w2,…wi-1) = P(wi|wi-N+1,wi-N+2,…wi-1)

Occurrence of a word depends on previous N 1 words only

N-gram language models

N = 2 : bigram P(wi | wi-1)

N = 3 : tri-gram P(wi | wi-2 , wi-1)

N = 4 : four-gram P(wi | wi-3 , wi-2, wi-1)

N = 1 : unigram P(wi)

probabilities estimated from a training text database

example : tri-gram model

P(W) = P(w1) P(w2|w1) II P(wi|wi-2 , wi-1)

R

i = 2

N

i = 3

Page 49: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

N-gram

tri-gram

W1 W2 W3 W4 W5 W6 ...... WR

W1 W2 W3 W4 W5 W6 ...... WR

𝑃 (𝑊 )=𝑃 (𝑤1)∏𝑖=2

𝑅

𝑃 (𝑤𝑖|𝑤1 ,𝑤2 ,⋯ ,𝑤 𝑖−1 )

𝑃 (𝑊 )=𝑃 (𝑤1 )𝑃 (𝑤2|𝑤1 )∏𝑖=3

𝑁

𝑃 (𝑤 𝑖|𝑤𝑖 −2 ,𝑤 𝑖−1 )

Page 50: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

2.0

Language Modeling - Evaluation of N-gram model parameters

unigram

P(wi) =

wi : a word in the vocabulary

V : total number of different words in the vocabulary N( ) number of counts in the training text database

bigram

P(wj|wk) =

< wk, wj > : a word pair

trigram

P(wj|wk,wm) =

smoothing estimation of probabilities of rare events by statistical approaches

N(wi)

N (wj)

V

j = 1

N(<wk,wj>)

N (wk)

N(<wk,wm,wj>)

N (<wk,wm>)

Page 51: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

   …   th i s  ………                        50000

   …   th i s   i s  ……                           500

   …   th i s   i s   a  …                               5

 

Prob [ is| this ] = 

Prob [ a| this is ] = 

Page 52: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

2.0

Large Vocabulary Continuous Speech Recognition W = (w1, w2,…wR) a word sequence X = (x1, x2,…xT) feature vectors for a speech utterance

W* = Prob(W|X) MAP principle

Prob(W|X) = = max

Prob(X|W) P(W) = max

by HMM by language model

˙ A Search Process Based on Three Knowledge Sources

X

- Acoustic Models : HMMs for basic voice units (e.g. phonemes)

- Lexicon : a database of all possible words in the vocabulary, each word including its pronunciation in terms of component basic voice units

- Language Models : based on words in the lexicon

Arg Max

w

Search W*

Acoustic Models

Lexicon Language

Models

Prob(X|W) P(W)

P(X)

Page 53: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Text-to-speech Synthesis

Text Analysis and Letter-to-

sound Conversion

Text Analysis and Letter-to-

sound Conversion

Prosody Generation

Prosody Generation

Signal Processing

and Concatenation

Signal Processing

and Concatenation

Lexicon and Rules

Prosodic Model

Voice Unit Database

Input Text

Output Speech Signal

• Transforming any input text into corresponding speech signals • E-mail/Web page reading • Prosodic modeling • Basic voice units/rule-based, non-uniform units/corpus-based, model-

based

Page 54: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

假如今天胡適先生還活著,又會怎樣呢?text analysis

假如 | 今天 | 胡適 | 先生 | 還 | 活 | 著 | ,又 | 會 | 怎樣 | 呢 | ?Cbb, Nd, Nb, Na, Dfa,VH,Di,, D, D, D, T,?

假如 | 今天 胡適 | 先生 還 | 活 | 著 | , 又 | 會 怎樣 | 呢 | ?Cbb, Nd, Nb, Na, Dfa,VH,Di,,D, D, D, T,?

5 1 3 1 2 1 1 4 1 2 1 5

prosody generation

Energy

Pause

Intonation

Tone

Duration

speech synthesis

· Three Major Steps- text analysis- prosody generation- speech synthesis

Text-to-Speech Synthesis

Page 55: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Automatic Prosodic Analysis for an Arbitrary Text Sentence

Cbb Nd Nb Na Dfa VH Di

Nd+Nb

VH+DiCbb+Nd+Nb

endstart

Na+Dfa

Nb+Na Dfa+VHCbb+Nd

Dfa+VH+Di

· Predict B1,B2,B3 (B4,B5 determined by Punctuation marks)· minor phrase patterns: (Cbb+Nd,Nb+Na,Dfa+VH+Di,D+D,D+T,etc)

假如 | 今天 胡適 | 先生還 | 活 | 著 | ,又 | 會 怎樣 | 呢 | ?Cbb,Nd, Nb, Na,Dfa,VH,Di,,D,D, D, T,?minor phrases

prosodic groups

major phrases

breath groups

words

break indices5 1 3 1 2 1 1 4 1 2 1 5

Page 56: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Speech Understanding

• Understanding Speaker’s Intention rather than Transcribing into Word Strings

• Limited Domains/Finite Tasks

acoustic models

phrase lexicon

Syllable Recognition

Syllable Recognition

Key Phrase Matching

Key Phrase Matching

input utterance syllable lattice phrase graph

concept graph

concept set

phrase/concept language model

Semantic Decoding

Semantic Decoding

understanding results

• An Example utterance: 請幫我查一下 台灣銀行 的 電話號碼 是幾號 ? key phrases: ( 查一下 ) - ( 台灣銀行 ) - ( 電話號碼 ) concept: (inquiry) - (target) - (phone number)

Page 57: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Speaker Verification

Feature Extraction

Feature Extraction VerificationVerification

input speech yes/no

• Verifying the speaker as claimed• Applications requiring verification • Text dependent/independent• Integrated with other verification schemes

Speaker Models

Speaker Models

Page 58: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Voice-based Information Retrieval

• Speech Instructions• Speech Documents (or Multi-media Documents including Speech

Information)

speech instruction

我想找有關新政府組成的新聞?我想找有關新政府組成的新聞?text instruction

d1

text documents

d2

d3d1

d2

d3

speech documents

總統當選人陳水扁今天早上…

Page 59: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Spoken Dialogue Systems

• Almost all human-network interactions can be accomplished by spoken dialogue

• Speech understanding, speech synthesis, dialogue management• System/user/mixed initiatives• Reliability/efficiency, dialogue modeling/flow control• Transaction success rate/average number of dialogue turns

Databases

Sentence Generation and Speech SynthesisOutput

Speech

Input Speech

DialogueManager

Speech Recognition and Understanding

User’s Intention

Discourse Context

Response to the user

Internet

Networks

Users

Dialogue Server

Page 60: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

“Speech and Language Processing over the Web”, IEEE Signal Processing Magazine, May 2008

References