Upload
vanliem
View
243
Download
1
Embed Size (px)
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 1
May 10, 2012
Signal Processing
For Speech Applications
- Part 1
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 2
References:
• Huang et al., Chapter on DSP
• Classical paper: Schafer/Rabiner in Waibel/Lee (on the
web)
• Nahin: "Dr. Euler's Fabulous Formula" – excellent
explanation of Fourier sums and the Fourier Transform,
written for Engineering students
Note: many slides of this lecture are from Rich Stern
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 3
Signal Processing For Speech Applications
• Major speech technologies:
– Speech coding
– Speech synthesis
– Speech recognition
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 4
Goals Of Speech Representations
• Capture important phonetic information in speech
• Computational efficiency
• Efficiency in storage requirements
• Optimize generalization
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 5
5
Representation of Speech
Definition: Digital representation of speech
Represent speech as a sequences of numbers
(as a prerequisite for automatic processing using computers)
1) Direct representation of speech waveform:
represent speech waveform as accurate as possible so that an
acoustic signal can be reconstructed
2) Parametric representation
Represent a set of properties/parameters with regard to a certain model
Decide the targeted application first:
• Speech coding
• Speech synthesis
• Speech recognition
Classical paper: Schafer/Rabiner in Waibel/Lee (paper online)
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 6
Speech Coding
Objectives of Speech Coding:
– Quality versus bit rate
– Quantization Noise
– High measured intelligibility
– Low bit rate (b/s of speech)
– Low computational requirement
– Robustness to transmission errors
– Robustness to successive
encode/decode cycles
Objectives for real-time:
– Low coding/decoding delay
– Work with non-speech
signals (e.g. touch tone)
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 7
Speech Synthesis
The standard approach:
Speech Synthesis = Text-to-Speech conversion (don't confuse
that with Speech Recognition)
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 8
Speech Recognition
• Major functional components:
– Signal processing to extract relevant features from speech
waveforms
– Comparison of features to pre-stored templates
• Important design choices:
– Choice of features (Optimize Generalization)
– Specific method of comparing features to stored templates
Feature
extraction
Decision
making
procedure A/D
Hypotheses
(phonemes)
Speech
features Speech
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 9
Goals Of This Lecture
How do we accomplish feature extraction for speech recognition?
Some specific topics:
– Quantization (A/D Conversion)
– Sampling
– Filter Bank Coefficients
– Linear predictive coding (LPC)
– LPC-derived cepstral coefficients (LPCC)
– Mel-frequency cepstral coefficients (MFCC)
Some of the underlying mathematics
– Continuous-time Fourier transform (CTFT)
– Discrete-time Fourier transform (DTFT)
– Z-transforms
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 1
0
Overview (I)
• Introduction
– Why Perform Signal Processing?
• Different Kinds of Signals
– Quantization
• Quantization of Signals
• Quantization of Speech Signals
– Sampling of continuous-time signals
• How Frequently Should we Sample? – Aliasing
• The Sampling Theorem
• Acoustic Features in the Sampled Signal
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 1
1
Overview (II)
• Frequency Representations in Continuous and Discrete Time (I)
– Why Perform Signal Processing In The Frequency Domain?
– The Source-Filter Model For Speech
– Frequency Representation
– Fourier Sums
– Continuous-time Fourier Transform (CTFT)
• Features of the Fourier Transform
– Discrete Time Fourier Transform (DTFT)
– Convolution
• Convolution of a Function with an Impulse Train
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 1
2
Overview (III)
• Frequency Representations in Continuous and Discrete Time (II)
– The Fourier Transform – Example
– Examples of CTFTs and DTFTs
– Correspondence Between Frequency In Discrete And Continuous Time
– Short-Term Spectral Analysis
– Short-time Fourier Analysis
– Windowing
• Two Popular Window Shapes
• The Effect of Applying Windows
• Breaking Up Incoming Speech Into Frames Using Hamming Windows
• Time and Frequency Response of Windows
• Speech Visualisation with Spectrograms
• Effect of Window Duration
• Using Filterbanks
– The Mel Scale
– The Bark Scale
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 1
3 • Introduction
– Why Perform Signal Processing?
• Different Kinds of Signals
– Quantization
• Quantization of Signals
• Quantization of Speech Signals
– Sampling of continuous-time signals
• How Frequently Should we Sample? – Aliasing
• The Sampling Theorem
• Acoustic Features in the Sampled Signal
Overview
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 1
4
Why Perform Signal Processing?
• A look at the time-domain waveform of “six”:
How do we get this into a computer?
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 1
5
Sampling & Quantization
Goal: Given a signal that is continuous in time and amplitude,
find a discrete representation.
Two steps are necessary: sampling and quantization.
• Quantization corresponds to a discretization of the y-axis
• Sampling corresponds to a discretization of the x-axis
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 1
6
• Given a discrete signal f[i] to be quantized into q[i]
• Assume that f is between fmin and fmax
• Partition y-axis into a fixed number n of (equally sized) intervals
• Usually n=2b , in ASR typically b=16 n=65536 (16-bit quantization)
• q[i] can only have values that are centers of the intervals
• Quantization: assign q[i] the center of the interval in which lies f[i]
• Quantization makes errors, i.e. adds noise to the signal f[i]=q[i]+e[i]
• The average quantization error e[i] is (fmax-fmin)/(2n)
• Define signal to noise ratio SNR[dB] = power(f[i]) / power(e[i])
Quantization of Signals
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 1
7 • Choice of quantization depth:
Speech signals are usually in the range between 50 dB and 60 dB
The lower the SNR, the lower the speech recognition performance
To get a reasonable SNR, b should be at least 10 to 12
Each bit contributes to about 6db of SNR
Typically in ASR the samples are quantized with 16 bits
• Speech signals' amplitudes are not equally distributed
Speech amplitudes are exponentially distributed around their mean
So the information-theoretic optimum is not to partition the y-axis
into equal intervals
therefore, often used: µ-law encoding
f(µ)[n] = fmax · sgn(f[n]) · log(1+µ|f[n]|/fmax) / log(1+µ) , µ=100,...,500
µ-law encoding makes speech amplitudes equally distributed before
quantization
Quantization of Speech Signals
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 1
8
Sampling Continuous-time Signals
Sampling converts the analog signal (in the time domain) into a digital
representation (also in the time domain!)
Time domain:
• x-axis = time
• y-axis = amplitude
Questions:
• Do we lose information during sampling?
• How can we prevent it?
0 20 40 60 80 100 120-0.5
0
0.5
1
1.5
Original speech signal
0 20 40 60 80 100 120-0.5
0
0.5
1
1.5
Sampled version of signal
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 1
9
How Frequently Should we Sample? – Aliasing
• The number of samples per unit of time (usually seconds) taken from
a continuous signal to make a discrete signal is called sampling rate
• The unit for sampling rate is hertz (inverse seconds, 1/s)
• Example: Correct sampling (above) /
undersampling (below)
– After undersampling we can not
(uniquely) recover the original frequency!
– This effect is called aliasing.
– How can we prevent aliasing?
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 2
1 • How can we prevent aliasing?
In practical applications, the signals are band-limited
Those band-limited signals do not contain any arbitrary frequency.
But there is a maximum frequency.
The band limitation can be ensured already during the recording process by
an appropriate A/D converter (with analog frequency filters)
When the sampling rate is too low, the samples can contain "incorrect"
frequencies.
This frequence 2fl is also called Nyquist rate.
The maximal frequency fl, which can be returned correctly is called
Nyquist frequency.
The Aliasing Effect
Nyquist or sampling theorem:
When a fl-band-limited signal is sampled with a sampling rate of at
least 2fl then the signal can be exactly reproduced from the samples.
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 2
2
Why Perform Signal Processing?
• A look at the time-domain waveform of “six”:
It’s hard to infer much from the time-domain waveform
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 2
3 • Amplitude (1) (not very useful)
The envelope of the amplitude is correlated to the power
(= integral of the squared signal)
Power is useful for detecting speech vs. silence,
also syllables, phrase boundaries, prosody
• Peak to peak (2) (correlated to amplitude)
• Root mean square (RMS) amplitude (3): Square root of the mean over time
(proportional to average power)
• Zero crossing rate can help distinguish some
weak sounds from silence
• Voiced/unvoiced homogeneous periodic signal vs. white noise in fricatives
Acoustic Features in the Sampled Signal
Source: Wikipedia
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 2
4 • By now we have considered the signal in the time domain,
i.e. we have shown the amplitude (signal strength) as a function of time
• Now we consider the signal in the frequency domain,
i.e. we want to know which frequency components occur in the signal
• Example for frequencies: A pure oscillation (left) correponds to a peak at
a specific frequency (right). If we mix
frequencies (bottom), the frequency
spectrum gets more complex
Sum of 2 wave functions Left: time domain Right: frequency domain (spectrum) Source: Ivica Rogina's lecture
The Frequency Domain
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 2
5 • A very common transform from
time domain to frequency domain:
Fourier Transform
• The frequency spectrum of a
time-domain signal is a
representation of that signal
in the frequency domain.
Sum of 2 wave functions Left: time domain Right: frequency domain
The Frequency Domain
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 2
6 • Frequency Representations in Continuous and Discrete Time (I)
– Why Perform Signal Processing In The Frequency Domain?
– The Source-Filter Model For Speech
– Frequency Representation
– Fourier Sums
– Continuous-time Fourier Transform (CTFT)
• Features of the Fourier Transform
– Discrete Time Fourier Transform (DTFT)
– Convolution
• Convolution of a Function with an Impulse Train
Overview
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 2
7
Why Perform Signal Processing In The Frequency Domain?
• Use of frequency analysis often simplifies signal
processing
• Use of frequency analysis often facilitates
understanding
• Human hearing is based on frequency analysis
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 2
8
Frequency Representation
• Frequency representations in continuous and discrete time
– Continuous-time Fourier transform (CTFT)
– Discrete-time Fourier transform (DTFT)
– Short-time Fourier analysis
– Effects of window size and shape
– Z-transforms
• Feature extraction for speech recognition
• Signal processing for robust speech recognition
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 2
9
The Source-Filter Model For Speech
• Sounds are produced either by
- vibrating the vocal cords (voiced sounds) or
- random noise resulting from friction of the airflow (unvoiced sounds)
- voiced fricatives need a mixed excitation model
• Signal un is modulated by the vocal tract, whose impulse response is vn
• Resulting signal is modulated by the lips and nostrils' radiation response rn
• Eventually the resulting signal fn is emitted.
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 3
0 • A useful model for representing the generation
of speech sounds:
Pitch Pulse train
source
Noise source
Vocal tract model
Amplitude
p[n]
The Source-Filter Model For Speech
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 3
2 • For a continuous function of time f(t), we define its Fourier
Transform by
• is the complex oscillation (with the
frequency ω): For t between 0 and 2π/ω, the point eiωt
moves around the unit circle in the
complex plane. i is the imaginary unit.
Continuous-Time Fourier Transform
dtetfF ti )()(
tite ti sincos
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 3
3 • One can interpret F(ω) as the content of frequency ω in the
function f(t).
• The Fourier transform is invertible:
• This means that no information is lost by the transformation!
Continuous-Time Fourier Transform
deFtf ti)(2
1)(
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 3
4
We can express the Fourier transform F of a function f in polar coordinates:
Then
is called the spectrum,
and
is the amplitude spectrum
and
is the phase spectrum
and
is the power spectrum
• Often, when we say "spectrum" we actually mean "power spectrum".
• It is general consensus that the phase spectrum in not important for
speech recognition.
Continuous-Time Fourier Transform (CTFT)
)sin()cos( tite ti
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 3
5 • The Fourier transform does have some nice features:
Often, the effect of a channel on a signal can be described as a
convolution in the time domain.
Then the corresponding effect in the frequency domain becomes a
multiplication.
And in the log-frequency domain we get a simple summation.
Features of the Fourier Transform
Time shift
Frequency Shift
Convolution
Linearity
Differentiation
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 3
6 • We can transfer the idea of the Fourier transform to a discrete
signal. Then we get the Discrete Time Fourier Transform of a
sequence x[k]:
• DTFT is periodic with period 2 !
Thus we use only an interval of this length of for the inverse
transformation:
• Note: DTFT transforms a discrete signal into a continuous
function.
Discrete Time Fourier Transform (DTFT)
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 3
7
The Dirac distribution is defined
as a single impulse with the area 1.0.
Its Fourier transform is 1.0
A discrete function s[n]
can be expressed
as the weighted sum of
Dirac distributions:
Consequently, we can
define the
discrete Fourier
transform as a sum:
Discrete Time Fourier Transform (DTFT)
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 3
8 • The convolution of two functions f(t) * g(t) is defined as:
• or in the discrete case:
• Convolution is a commutative operation!
Convolution
dgtftgf )()())((
i
kgkifkgf ][][])[(
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 3
9 • Computing the Fourier Transform of
on both sides, we get
i.e. the convolution becomes a multiplication in frequency domain!
(Why?)
Convolution
dgtftgfth )()())(()(
)()()( GFH
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 4
0 • The convolution of a function with an impulse does not change the
function
• The convolution of a function with an impulse train causes identical
copies of the function, if the impulses are located far apart.
Convolution of a Function with an Impulse Train
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 4
1
Overview
• Frequency Representations in Continuous and Discrete Time (II)
– The Fourier Transform – Example
– Examples of CTFTs and DTFTs
– Correspondence Between Frequency In Discrete And Continuous Time
– Short-Term Spectral Analysis
– Short-time Fourier Analysis
– Windowing
• Two Popular Window Shapes
• The Effect of Applying Windows
• Breaking Up Incoming Speech Into Frames Using Hamming Windows
• Time and Frequency Response of Windows
• Speech Visualisation with Spectrograms
• Effect of Window Duration
• Using Filterbanks
– The Mel Scale
– The Bark Scale
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 4
2
The Fourier Transform – Example
• Example for the Fourier Transform:
Left: audio signal, right: frequency spectrum
The audio signal has been samples with 8 kHz
Thus the spectrum has a fundamental frequency of 4 KHz
The vertical axis in the frequency spectrum is logarithmically.
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 4
3
Examples of CTFTs and DTFTs
• Continuous-time decaying exponential:
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
Time, seconds
0 5 10 15 20 25 30 35 40 45 500
0.5
1
Frequency, Hz
0 5 10 15 20 25 30 35 40 45 50-2
-1.5
-1
-0.5
0
Frequency, Hz
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 4
4 • Discrete-time decaying exponential:
0 5 10 15 20 25 30 35 40 45 500
0.5
1
Time, seconds
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
Frequency, Radians/pi
0 0.5 1 1.5 2 2.5 3 3.5 4-1
-0.5
0
0.5
1
Frequency, Radians/pi
Examples of CTFTs and DTFTs
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 4
5
Correspondence Between Frequency In Discrete And Continuous Time
• Suppose that a continuous-time signal is sampled at a rate
greater than the Nyquist rate, with a time between samples of T
• Let W represent continuous-time frequency in radians/sec
• Let represent discrete-time frequency in radians
• Then …
• Comment:
The maximum discrete-time frequency, w = p, corresponds to the
Nyquist frequency in continuous time, half the sampling rate
WT
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 4
6
Facts:
• The frequency distribution
over an entire utterance does
not help much for recognition.
• Most acoustic events (e.g.
phonemes) have durations in
the range of 10 to 100 ms.
• Many acoustic events are
not static (diphtongs) and
need more detailed analysis.
Solution:
• Partition the entire recording
in a sequence of short segments
• The segments may overlap each
other
Short-Term Spectral Analysis
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 4
7
Short-time Fourier Analysis
• Problem: Conventional Fourier analysis does not capture time-varying nature of speech signals
• Solution: Multiply signals by finite-duration window function, then compute DTFT:
• Side effect: Windowing causes spectral blurring
X(ej
) x[n]e jn
n
X[n, ] x[m]w[nm]e jm
m0
N1
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 4
8
Windows Shapes
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 4
9
Two Popular Window Shapes
• Rectangular window:
• Hamming window:
• Comment: Hamming window is frequently preferred because of its frequency response
w[n]1, 0 n N
0, otherwise
w[n].54 .46cos(2n / N), 0 n N
0,otherwise
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 5
0
• When we compute the Fourier transform of a wn-windowed signal
s(m)[n], we get:
• Now let W be the Fourier transform of the window,
then we can express the short time spectrum in terms of a convolution:
• So the application of a window to the speech signal before computing the
short time spectrum has the same effect as if we computed the
convolution of the original short time spectrum with the window's fourier
transform.
The Effect of Applying Windows
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 5
1
• A window has the effect of "smoothing" the resulting spectrum
• The stepsize (when to compute a short time spectrum) and the window size do not have
to be the same. (Often: window size = 2 · stepsize)
• Most often used window shape is Hamming window
Signal Fourier Transform
of the Window
Fourier Transform of
the Windowed Signal
Rectangle
Window
Hamming
Window
The Effect of Applying Windows
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 5
2
Breaking Up Incoming Speech Into Frames Using
Hamming Windows
20 40 60 80 100 120-0.01
0
0.01
0.02
20 40 60 80 100 120-0.01
0
0.01
0.02
20 40 60 80 100 120-0.01
0
0.01
0.02
20 40 60 80 100 120-0.01
0
0.01
0.02
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 5
3
Time and Frequency Response of Windows
0 10 20 30 400
0.2
0.4
0.6
0.8
1
n
Rectangular Window
0 0.2 0.4 0.6 0.8 1-30
-20
-10
0
10
20
30
40
/
0 10 20 30 400
0.2
0.4
0.6
0.8
1
n
Hamming Window
0 0.2 0.4 0.6 0.8 1-80
-60
-40
-20
0
20
40
/
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 5
4
Speech Visualisation with Spectrograms
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 5
5
Effect of Window Duration
Short-duration window: Long-Duration window:
Time
Fre
qu
en
cy
0 0.2 0.4 0.6 0.8 1 1.2
0
1000
2000
3000
4000
5000
Time
Fre
qu
en
cy
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
5000
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 5
6
• All Fourier coefficients reflect too much of the signals microstructure
• The microstructure contains redundancies but also a lot of "misleading"
information
• Solution – Filterbanks: The human ear also works with "filterbanks"
• Different approaches to computing filterbank coefficients:
Fixed width
filters:
Variable
width:
Overlapping
filters:
Using Filterbanks
• Typical filterbanks: mel or bark scales
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 5
7 • Mel: Perceptual scale of pitches (MELody)
Proposed by Stevens Volkman and Newman 1937
Let people listen to tones and judge what pitch they perceive
Find pitches to be perceived equal in distance from one another
Compare it with the actual pitch and define a function
Result = melscale: 1 mel = 1127 log (1+ f/700)
Above 500Hz larger and larger
intervals are judged to be equal
pitch increments
(4 hertz octaves correspond to
two mel octaves)
Then: make filterbanks with equal
bandwidths in mels
The Mel Scale
Source: Wikipedia
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 5
8
• Experiment (by Barkhausen):
Let person listen to two frequencies
If the frequencies differ significantly, then person does notice this
If the frequencies are close, the person can hear only one frequency
The minimal frequency-distance (critical bandwidth) is called 1 bark
The scale ranges from 1 to 24 and corresponds to the first 24
critical bands of hearing.
The subsequent band edges are (in Hz) 20, 100, 200, 300, 400,
510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700,
3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, 15500.
The Bark Scale
Sig
nal P
rocessin
g f
or
Speech A
pplic
ations
- 5
9
Thanks for your interest!