Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals

Fundamentals of Speech Signal Processing

1.0 Speech Signals

Waveform plots of typical vowel sounds - Voiced（濁音）

tone 1

tone 2

tone 4

t

Speech Production and Source Model

tu

• Human vocal mechanism • Speech Source Model

Vocaltract

tx

u(t)

x(t)

Voiced and Unvoiced Speech

u(t) x(t)

pitch pitch

voiced

unvoiced

Waveform plots of typical consonant sounds

Unvoiced （清音） Voiced （濁音）

Waveform plot of a sentence

Frequency domain spectra of speech signals

Voiced Unvoiced

Frequency Domain

Voiced

formant frequencies

Unvoiced

formant frequencies

Frequency Domain

Spectrogram

Spectrogram

Formant Frequencies

Formant frequency contours

He will allow a rare lie.

Reference: 6.1 of Huang, or 2.2, 2.3 of Rabiner and Juang

2.0 Speech Signal Processing

Speech Signal Processing

• Major Application Areas1. Speech Coding:Digitization and Compression

Considerations : 1) bit rate (bps) 2) recovered quality 3) computation

complexity/feasibility

2. Voice-based Network Access —

User Interface, Content Analysis, User-content Interaction

LPF outputProcessing Algorithms

x(t) x[n]

Processing xk

110101…Inverse

Processing

x[n] x[n]^

Storage/transmission

• Speech Signals– Carrying Linguistic

Knowledge and Human Information: Characters, Words, Phrases, Sentences, Concepts, etc.

– Double Levels of Information: Acoustic Signal Level/Symbolic or Linguistic Level

– Processing and Interaction of the Double-level Information

Sampling of Signals

X(t)X[n]

t n

Double Levels of Information

字 (Character)

詞 (Word)

句 (Sentence)人人用電腦電腦

Speech Signal Processing – Processing of Double-Level Information• Speech Signal • Sampling • Processing

• Linguistic Structure

• Linguistic Knowledge

今天的

常好

Lexicon Grammar今天的

今天的天氣非常好

Algorithm

Chips or Computers 天氣非

Voice-based Network Access

Content AnalysisUser Interface

Internet

User-Content Interaction

User Interface

—when keyboards/mice inadequate Content Analysis — help in browsing/retrieval of multimedia content User-Content Interaction —all text-based interaction can be accomplished by spoken language

User Interface —Wireless Communications Technologies are Creating a Whole Variety of User Terminals

at Any Time, from Anywhere Smart phones, Hand-held Devices, Notebooks, Vehicular Electronics, Hands-

free Interfaces, Home Appliances, Wearable Devices… Small in Size, Light in Weight, Ubiquitous, Invisible… Post-PC Era Keyboard/Mouse Most Convenient for PC’s not Convenient any longer

— human fingers never shrink, and application environment is changed Service Requirements Growing Exponentially Voice is the Only Interface Convenient for ALL User Terminals at Any Time,

from Anywhere, and to the point in one utterance Speech Processing is the only less mature part in the Technology Chain

Internet Networks

Text Content

Multimedia Content

Content Analysis—Multimedia Technologies are Creating a New World of Multimedia Content

• Most Attractive Form of the Network Content will be in Multimedia, which usually Includes Speech Information (but Probably not Text)

• Multimedia Content Difficult to be Summarized and Shown on the Screen, thus Difficult to Browse

• The Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of the Multimedia Content, thus Becomes the Key for Browsing and Retrieval

• Multimedia Content Analysis based on Speech Information

Future Integrated Networks

Real–time Information– weather, traffic– flight schedule– stock price– sports scores

Special Services– Google– FaceBook–YouTube– Amazon

Knowledge Archieves– digital libraries– virtual museums

Intelligent Working Environment– e–mail processors– intelligent agents– teleconferencing– distant learning– electric commerce

Private Services– personal notebook– business databases– home appliances– network entertainments

User-Content Interaction — Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processing

voice information Multimedia

ContentInternet

voice

input/

output

text information

• Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Speech

• User-Content Interaction can be Accomplished by Spoken and Multi-modal Dialogues• Hand-held Devices with Multimedia Functionalities Commonly used Today• Using Speech Instructions to Access Multimedia Content whose Key Concepts Specified

by Speech Information

Multimedia Content Analysis

Text Information Retrieval

Text Content

Voice-based Information

Retrieval

Text-to-Speech Synthesis

Spoken and multi-modal

Dialogue

3.0 Speech Coding

Waveform-based Approaches

• Pulse-Coded Modulation (PCM)

– binary representation for each sample x[n] by quantization

• Differential PCM (DPCM)

– encoding the differences

d[n] = x[n] x[n1]

d[n] = x[n] ak x[nk]

• Adaptive DPCM (ADPCM)

– with adaptive algorithms

Ref : Haykin, “Communication Systems”, 4-th Ed.

3.7, 3.13, 3.14, 3.15

P

k=1

Speech Source Model and Source Coding

•Speech Source Model

– digitization and transmission of the parameters will be adequate– at receiver the parameters can produce x[n] with the model– much less parameters with much slower variation in time lead to much less

bits required– the key for low bit rate speech coding

x[n]u[n]

parameters parameters

Excitation Generator

Vocal Tract Model

Ex G(),G(z), g[n]

x[n]=u[n]g[n]X()=U()G()X(z)=U(z)G(z)

U ()U (z)

Speech Source Model

x(t)

a[n]

t

n

Speech Source Model and Source Coding

Analysis and Synthesis

– High computation requirements are the price for low bit rate

Simplified Speech Source Model

– Excitation parameters v/u : voiced/ unvoiced N : pitch for voiced G : signal gain excitation signal u[n]

unvoiced

voiced

randomsequencegenerator

periodic pulse traingenerator

x[n]

G(z) = 1

1 akz-k P

k = 1

Excitation

G(z), G(), g[n]

Vocal Tract Model

u[n]

Gv/u

N

– Vocal Tract parameters {ak} : LPC coefficients formant structure of speech signals– A good approximation, though not precise enough

Reference: 3.3.1-3.3.6 of Rabiner and Juang, or 6.3 of Huang

LPC Vocoder(Voice Coder)

N by pitch detectionv/u by voicing detection

{ak} can be non-uniform or vector quantized to reduce bit rate further

Ref : 3.3 ( 3.3.1 up to 3.3.9 ) of Rabiner and Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 1993

Multipulse LPC

poor modeling of u(n) is the main source of quality degradation in LPC vocoder u[n] replaced by a sequence of pulses

u[n] = bk [nnk]

– roughly 8 pulses per pitch period

– u[n] close to periodic for voiced

– u[n] close to random for unvoiced

Estimating (bk , nk) is a difficult problem

k

Multipulse LPC

Estimating (bk , nk) is a difficult problem– analysis by synthesis

– large amount of computation is the price paid for better speech quality

Multipulse LPC

Perceptual Weighting

W(z) = (1 ak z -k) ／ (1 ak ck z -k)

0 < c < 1

for perceptual sensitivity

W(z) = 1 , if c = 1

W(z) = 1 ak z -k , if c = 0

practically c 0.8

Error Evaluation

E = |X() X() |2 W()d

P

k = 1

P

k = 1

2

0

^

Multipulse LPC

Error Minimization and Pulse search

u[n] = bk [nnk]

x[n] = bk g[nnk]

E = E ( b1 , n1 ,b2 , n2……)

– sub-optional solution finding 1 pulse at a time

^

k

k

Code-Excited Linear Prediction (CELP)

Use of VQ to Construct a Codebook of Excitation Sequences

– a sequence consists of roughly 40 samples

– a codebook of 512 ~ 1024 patterns is constructed with VQ

– roughly 512 ~ 1024 excitation patterns are perceptually adequate

Excitation Search –analysis by synthesis

– 9 ~ 10 bits are needed for the excitation of 40 samples , while {ak} parameters in G(z) also vector quantized

Code-Excited Linear Prediction (CELP)

Receiver

– {ak} codewords can be transmitted less frequently than excitation codeword

Ref : Gold and Morgan, “Speech and Audio Signal Processing”, John Wiley & Sons, 2000, Chap 33

4.0 Speech Recognition andVoice-based Network Access

Feature Extraction

unknown speech signal

Pattern Matching

Decision Making

x(t)WX

output wordfeature

vector sequence

Reference Patterns

Feature Extraction

y(t) Y

training speech

Speech Recognition as a pattern recognition problem

• A Simplified Block Diagram

• Example Input Sentence this is speech• Acoustic Models (聲學模型 ) (th-ih-s-ih-z-s-p-ih-ch)• Lexicon (th-ih-s) → this (ih-z) → is (s-p-iy-ch) → speech• Language Model (語言模型 ) (this) – (is) – (speech)

P(this) P(is | this) P(speech | this is) P(wi|wi-1) bi-gram language model

P(wi|wi-1,wi-2) tri-gram language model,etc

Basic Approach for Large Vocabulary Speech Recognition

Front-endSignal Processing

AcousticModels Lexicon

FeatureVectors

Linguistic Decoding and

Search Algorithm

Output Sentence

SpeechCorpora

AcousticModel

Training

LanguageModel

Construction

TextCorpora

LanguageModel

Input Speech

2.0

2.0 Fundamentals of Speech Recognition

Hidden Markov Models (HMM)

˙ Formulation Ot= [x1, x2, …xD]T feature vectors for frame at time t qt= 1,2,3…N state number for feature vector Ot

A =[ aij ] , aij = Prob[ qt = j | qt-1 = i ] state transition probability

B =[ bj(o), j = 1,2,…N] observation probability

bj(o) = cjkbjk(o)

bjk(o): multi-variate Gaussian distribution for the k-th mixture of the j-th state M : total number of mixtures

cjk = 1

= [ 1, 2, …N ] initial probabilities i = Prob[q1= i]

HMM : ( A , B, ) =

a22 a11

M

k = 1

M

k = 1

o1 o2 o3 o4 o5 o6 o7 o8…………

q1 q2 q3 q4 q5 q6 q7 q8…………

observation sequence state sequence

b1(o) b2 (o) b3(o)

a13 a24

a23 states a12

Observation Sequences

1-dim Gaussian Mixtures

State Transition Probabilities

2.0

Hidden Markov Models (HMM)

˙ Double Layers of Stochastic Processes

- hidden states with random transitions for time warping

- random output given state for random acoustic characteristics

˙ Three Basic Problems

(1) Evaluation Problem:

Given O =(o1, o2, …ot…oT) and = (A, B, )

find Prob [ O | ]

(2) Decoding Problem:

Given O = (o1, o2, …ot…oT) and = (A, B, )

find a best state sequence q = (q1,q2,…qt,…qT)

(3) Learning Problem:

Given O, find best values for parameters in

such that Prob [ O | ] = max

Simplified HMM

RGBGGBBGRRR……

Peripheral Processing for Human Perception (P.34 of 7.0 )

2.0

Feature Extraction (Front-end Signal Processing)

˙ Mel Frequency Cepstral Coefficients (MFCC)

- Mel-scale Filter Bank triangular shape in frequency/overlapped uniformly spaced below 1 kHz

logarithmic scale above 1 kHz

˙ Delta Coefficients

- 1st/2nd order differences

Discrete Fourier Transform

windowed speech samples

MFCC Mel-scale Filter Bank

log( | |2 )

Inverse Discrete Fourier Transform

Mel-scale Filter Bank

2.0

Language Modeling: N-gram

W = (w1, w2, w3,…,wi,…wR) a word sequence

- Evaluation of P(W)

P(W) = P(w1) II P(wi|w1, w2,…wi-1)

- Assumption:

P(wi|w1, w2,…wi-1) = P(wi|wi-N+1,wi-N+2,…wi-1)

Occurrence of a word depends on previous N 1 words only

N-gram language models

N = 2 : bigram P(wi | wi-1)

N = 3 : tri-gram P(wi | wi-2 , wi-1)

N = 4 : four-gram P(wi | wi-3 , wi-2, wi-1)

N = 1 : unigram P(wi)

probabilities estimated from a training text database

example : tri-gram model

P(W) = P(w1) P(w2|w1) II P(wi|wi-2 , wi-1)

R

i = 2

N

i = 3

N-gram

tri-gram

W1 W2 W3 W4 W5 W6 ...... WR

W1 W2 W3 W4 W5 W6 ...... WR

𝑃 (𝑊 )=𝑃 (𝑤1)∏𝑖=2

𝑅

𝑃 (𝑤𝑖|𝑤1 ,𝑤2 ,⋯ ,𝑤 𝑖−1 )

𝑃 (𝑊 )=𝑃 (𝑤1 )𝑃 (𝑤2|𝑤1 )∏𝑖=3

𝑁

𝑃 (𝑤 𝑖|𝑤𝑖 −2 ,𝑤 𝑖−1 )

2.0

Language Modeling - Evaluation of N-gram model parameters

unigram

P(wi) =

wi : a word in the vocabulary

V : total number of different words in the vocabulary N( ) number of counts in the training text database

bigram

P(wj|wk) =

< wk, wj > : a word pair

trigram

P(wj|wk,wm) =

smoothing estimation of probabilities of rare events by statistical approaches

N(wi)

N (wj)

V

j = 1

N(<wk,wj>)

N (wk)

N(<wk,wm,wj>)

N (<wk,wm>)

… th i s ……… 50000

… th i s i s …… 500

… th i s i s a … 5

Prob [ is| this ] =

Prob [ a| this is ] =

2.0

Large Vocabulary Continuous Speech Recognition W = (w1, w2,…wR) a word sequence X = (x1, x2,…xT) feature vectors for a speech utterance

W* = Prob(W|X) MAP principle

Prob(W|X) = = max

Prob(X|W) P(W) = max

by HMM by language model

˙ A Search Process Based on Three Knowledge Sources

X

- Acoustic Models : HMMs for basic voice units (e.g. phonemes)

- Lexicon : a database of all possible words in the vocabulary, each word including its pronunciation in terms of component basic voice units

- Language Models : based on words in the lexicon

Arg Max

w

Search W*

Acoustic Models

Lexicon Language

Models

Prob(X|W) P(W)

P(X)

Text-to-speech Synthesis

Text Analysis and Letter-to-

sound Conversion

Text Analysis and Letter-to-

sound Conversion

Prosody Generation

Prosody Generation

Signal Processing

and Concatenation

Signal Processing

and Concatenation

Lexicon and Rules

Prosodic Model

Voice Unit Database

Input Text

Output Speech Signal

• Transforming any input text into corresponding speech signals • E-mail/Web page reading • Prosodic modeling • Basic voice units/rule-based, non-uniform units/corpus-based, model-

based

假如今天胡適先生還活著，又會怎樣呢？text analysis

假如 | 今天 | 胡適 | 先生 | 還 | 活 | 著 | ，又 | 會 | 怎樣 | 呢 | ？Cbb, Nd, Nb, Na, Dfa,VH,Di,， D, D, D, T,？

假如 | 今天胡適 | 先生還 | 活 | 著 | ，又 | 會怎樣 | 呢 | ？Cbb, Nd, Nb, Na, Dfa,VH,Di,，D, D, D, T,？

5 1 3 1 2 1 1 4 1 2 1 5

prosody generation

Energy

Pause

Intonation

Tone

Duration

speech synthesis

· Three Major Steps- text analysis- prosody generation- speech synthesis

Text-to-Speech Synthesis

Automatic Prosodic Analysis for an Arbitrary Text Sentence

Cbb Nd Nb Na Dfa VH Di

Nd+Nb

VH+DiCbb+Nd+Nb

endstart

Na+Dfa

Nb+Na Dfa+VHCbb+Nd

Dfa+VH+Di

· Predict B1,B2,B3 (B4,B5 determined by Punctuation marks)· minor phrase patterns: (Cbb+Nd,Nb+Na,Dfa+VH+Di,D+D,D+T,etc)

假如 | 今天胡適 | 先生還 | 活 | 著 | ，又 | 會怎樣 | 呢 | ？Cbb,Nd, Nb, Na,Dfa,VH,Di,，D,D, D, T,？minor phrases

prosodic groups

major phrases

breath groups

words

break indices5 1 3 1 2 1 1 4 1 2 1 5

Speech Understanding

• Understanding Speaker’s Intention rather than Transcribing into Word Strings

• Limited Domains/Finite Tasks

acoustic models

phrase lexicon

Syllable Recognition

Syllable Recognition

Key Phrase Matching

Key Phrase Matching

input utterance syllable lattice phrase graph

concept graph

concept set

phrase/concept language model

Semantic Decoding

Semantic Decoding

understanding results

• An Example utterance: 請幫我查一下台灣銀行的電話號碼是幾號 ? key phrases: ( 查一下 ) - ( 台灣銀行 ) - ( 電話號碼 ) concept: (inquiry) - (target) - (phone number)

Speaker Verification

Feature Extraction

Feature Extraction VerificationVerification

input speech yes/no

• Verifying the speaker as claimed• Applications requiring verification • Text dependent/independent• Integrated with other verification schemes

Speaker Models

Speaker Models

Voice-based Information Retrieval

• Speech Instructions• Speech Documents (or Multi-media Documents including Speech

Information)

speech instruction

我想找有關新政府組成的新聞？我想找有關新政府組成的新聞？text instruction

d1

text documents

d2

d3d1

d2

d3

speech documents

總統當選人陳水扁今天早上…

Spoken Dialogue Systems

• Almost all human-network interactions can be accomplished by spoken dialogue

• Speech understanding, speech synthesis, dialogue management• System/user/mixed initiatives• Reliability/efficiency, dialogue modeling/flow control• Transaction success rate/average number of dialogue turns

Databases

Sentence Generation and Speech SynthesisOutput

Speech

Input Speech

DialogueManager

Speech Recognition and Understanding

User’s Intention

Discourse Context

Response to the user

Internet

Networks

Users

Dialogue Server

“Speech and Language Processing over the Web”, IEEE Signal Processing Magazine, May 2008

References