Signal Processing For Speech Applications - Part 1 · Signal Processing For Speech Applications ... Schafer/Rabiner in Waibel/Lee ... Speech signals are usually in the range between

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 1

May 10, 2012

Signal Processing

For Speech Applications

- Part 1

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 2

References:

• Huang et al., Chapter on DSP

• Classical paper: Schafer/Rabiner in Waibel/Lee (on the

web)

• Nahin: "Dr. Euler's Fabulous Formula" – excellent

explanation of Fourier sums and the Fourier Transform,

written for Engineering students

Note: many slides of this lecture are from Rich Stern

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 3

Signal Processing For Speech Applications

• Major speech technologies:

– Speech coding

– Speech synthesis

– Speech recognition

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 4

Goals Of Speech Representations

• Capture important phonetic information in speech

• Computational efficiency

• Efficiency in storage requirements

• Optimize generalization

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 5

5

Representation of Speech

Definition: Digital representation of speech

Represent speech as a sequences of numbers

(as a prerequisite for automatic processing using computers)

1) Direct representation of speech waveform:

represent speech waveform as accurate as possible so that an

acoustic signal can be reconstructed

2) Parametric representation

Represent a set of properties/parameters with regard to a certain model

Decide the targeted application first:

• Speech coding

• Speech synthesis

• Speech recognition

Classical paper: Schafer/Rabiner in Waibel/Lee (paper online)

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 6

Speech Coding

Objectives of Speech Coding:

– Quality versus bit rate

– Quantization Noise

– High measured intelligibility

– Low bit rate (b/s of speech)

– Low computational requirement

– Robustness to transmission errors

– Robustness to successive

encode/decode cycles

Objectives for real-time:

– Low coding/decoding delay

– Work with non-speech

signals (e.g. touch tone)

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 7

Speech Synthesis

The standard approach:

Speech Synthesis = Text-to-Speech conversion (don't confuse

that with Speech Recognition)

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 8

Speech Recognition

• Major functional components:

– Signal processing to extract relevant features from speech

waveforms

– Comparison of features to pre-stored templates

• Important design choices:

– Choice of features (Optimize Generalization)

– Specific method of comparing features to stored templates

Feature

extraction

Decision

making

procedure A/D

Hypotheses

(phonemes)

Speech

features Speech

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 9

Goals Of This Lecture

How do we accomplish feature extraction for speech recognition?

Some specific topics:

– Quantization (A/D Conversion)

– Sampling

– Filter Bank Coefficients

– Linear predictive coding (LPC)

– LPC-derived cepstral coefficients (LPCC)

– Mel-frequency cepstral coefficients (MFCC)

Some of the underlying mathematics

– Continuous-time Fourier transform (CTFT)

– Discrete-time Fourier transform (DTFT)

– Z-transforms

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 1

0

Overview (I)

• Introduction

– Why Perform Signal Processing?

• Different Kinds of Signals

– Quantization

• Quantization of Signals

• Quantization of Speech Signals

– Sampling of continuous-time signals

• How Frequently Should we Sample? – Aliasing

• The Sampling Theorem

• Acoustic Features in the Sampled Signal

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 1

1

Overview (II)

• Frequency Representations in Continuous and Discrete Time (I)

– Why Perform Signal Processing In The Frequency Domain?

– The Source-Filter Model For Speech

– Frequency Representation

– Fourier Sums

– Continuous-time Fourier Transform (CTFT)

• Features of the Fourier Transform

– Discrete Time Fourier Transform (DTFT)

– Convolution

• Convolution of a Function with an Impulse Train

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 1

2

Overview (III)

• Frequency Representations in Continuous and Discrete Time (II)

– The Fourier Transform – Example

– Examples of CTFTs and DTFTs

– Correspondence Between Frequency In Discrete And Continuous Time

– Short-Term Spectral Analysis

– Short-time Fourier Analysis

– Windowing

• Two Popular Window Shapes

• The Effect of Applying Windows

• Breaking Up Incoming Speech Into Frames Using Hamming Windows

• Time and Frequency Response of Windows

• Speech Visualisation with Spectrograms

• Effect of Window Duration

• Using Filterbanks

– The Mel Scale

– The Bark Scale

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 1

3 • Introduction

– Why Perform Signal Processing?

• Different Kinds of Signals

– Quantization

• Quantization of Signals

• Quantization of Speech Signals

– Sampling of continuous-time signals

• How Frequently Should we Sample? – Aliasing

• The Sampling Theorem

• Acoustic Features in the Sampled Signal

Overview

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 1

4

Why Perform Signal Processing?

• A look at the time-domain waveform of “six”:

How do we get this into a computer?

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 1

5

Sampling & Quantization

Goal: Given a signal that is continuous in time and amplitude,

find a discrete representation.

Two steps are necessary: sampling and quantization.

• Quantization corresponds to a discretization of the y-axis

• Sampling corresponds to a discretization of the x-axis

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 1

6

• Given a discrete signal f[i] to be quantized into q[i]

• Assume that f is between fmin and fmax

• Partition y-axis into a fixed number n of (equally sized) intervals

• Usually n=2b , in ASR typically b=16 n=65536 (16-bit quantization)

• q[i] can only have values that are centers of the intervals

• Quantization: assign q[i] the center of the interval in which lies f[i]

• Quantization makes errors, i.e. adds noise to the signal f[i]=q[i]+e[i]

• The average quantization error e[i] is (fmax-fmin)/(2n)

• Define signal to noise ratio SNR[dB] = power(f[i]) / power(e[i])

Quantization of Signals

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 1

7 • Choice of quantization depth:

Speech signals are usually in the range between 50 dB and 60 dB

The lower the SNR, the lower the speech recognition performance

To get a reasonable SNR, b should be at least 10 to 12

Each bit contributes to about 6db of SNR

Typically in ASR the samples are quantized with 16 bits

• Speech signals' amplitudes are not equally distributed

Speech amplitudes are exponentially distributed around their mean

So the information-theoretic optimum is not to partition the y-axis

into equal intervals

therefore, often used: µ-law encoding

f(µ)[n] = fmax · sgn(f[n]) · log(1+µ|f[n]|/fmax) / log(1+µ) , µ=100,...,500

µ-law encoding makes speech amplitudes equally distributed before

quantization

Quantization of Speech Signals

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 1

8

Sampling Continuous-time Signals

Sampling converts the analog signal (in the time domain) into a digital

representation (also in the time domain!)

Time domain:

• x-axis = time

• y-axis = amplitude

Questions:

• Do we lose information during sampling?

• How can we prevent it?

0 20 40 60 80 100 120-0.5

0

0.5

1

1.5

Original speech signal

0 20 40 60 80 100 120-0.5

0

0.5

1

1.5

Sampled version of signal

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 1

9

How Frequently Should we Sample? – Aliasing

• The number of samples per unit of time (usually seconds) taken from

a continuous signal to make a discrete signal is called sampling rate

• The unit for sampling rate is hertz (inverse seconds, 1/s)

• Example: Correct sampling (above) /

undersampling (below)

– After undersampling we can not

(uniquely) recover the original frequency!

– This effect is called aliasing.

– How can we prevent aliasing?

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 2

1 • How can we prevent aliasing?

In practical applications, the signals are band-limited

Those band-limited signals do not contain any arbitrary frequency.

But there is a maximum frequency.

The band limitation can be ensured already during the recording process by

an appropriate A/D converter (with analog frequency filters)

When the sampling rate is too low, the samples can contain "incorrect"

frequencies.

This frequence 2fl is also called Nyquist rate.

The maximal frequency fl, which can be returned correctly is called

Nyquist frequency.

The Aliasing Effect

Nyquist or sampling theorem:

When a fl-band-limited signal is sampled with a sampling rate of at

least 2fl then the signal can be exactly reproduced from the samples.

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 2

2

Why Perform Signal Processing?

• A look at the time-domain waveform of “six”:

It’s hard to infer much from the time-domain waveform

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 2

3 • Amplitude (1) (not very useful)

The envelope of the amplitude is correlated to the power

(= integral of the squared signal)

Power is useful for detecting speech vs. silence,

also syllables, phrase boundaries, prosody

• Peak to peak (2) (correlated to amplitude)

• Root mean square (RMS) amplitude (3): Square root of the mean over time

(proportional to average power)

• Zero crossing rate can help distinguish some

weak sounds from silence

• Voiced/unvoiced homogeneous periodic signal vs. white noise in fricatives

Acoustic Features in the Sampled Signal

Source: Wikipedia

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 2

4 • By now we have considered the signal in the time domain,

i.e. we have shown the amplitude (signal strength) as a function of time

• Now we consider the signal in the frequency domain,

i.e. we want to know which frequency components occur in the signal

• Example for frequencies: A pure oscillation (left) correponds to a peak at

a specific frequency (right). If we mix

frequencies (bottom), the frequency

spectrum gets more complex

Sum of 2 wave functions Left: time domain Right: frequency domain (spectrum) Source: Ivica Rogina's lecture

The Frequency Domain

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 2

5 • A very common transform from

time domain to frequency domain:

Fourier Transform

• The frequency spectrum of a

time-domain signal is a

representation of that signal

in the frequency domain.

Sum of 2 wave functions Left: time domain Right: frequency domain

The Frequency Domain

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 2

6 • Frequency Representations in Continuous and Discrete Time (I)

– Why Perform Signal Processing In The Frequency Domain?

– The Source-Filter Model For Speech

– Frequency Representation

– Fourier Sums

– Continuous-time Fourier Transform (CTFT)

• Features of the Fourier Transform

– Discrete Time Fourier Transform (DTFT)

– Convolution

• Convolution of a Function with an Impulse Train

Overview

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 2

7

Why Perform Signal Processing In The Frequency Domain?

• Use of frequency analysis often simplifies signal

processing

• Use of frequency analysis often facilitates

understanding

• Human hearing is based on frequency analysis

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 2

8

Frequency Representation

• Frequency representations in continuous and discrete time

– Continuous-time Fourier transform (CTFT)

– Discrete-time Fourier transform (DTFT)

– Short-time Fourier analysis

– Effects of window size and shape

– Z-transforms

• Feature extraction for speech recognition

• Signal processing for robust speech recognition

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 2

9

The Source-Filter Model For Speech

• Sounds are produced either by

- vibrating the vocal cords (voiced sounds) or

- random noise resulting from friction of the airflow (unvoiced sounds)

- voiced fricatives need a mixed excitation model

• Signal un is modulated by the vocal tract, whose impulse response is vn

• Resulting signal is modulated by the lips and nostrils' radiation response rn

• Eventually the resulting signal fn is emitted.

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 3

0 • A useful model for representing the generation

of speech sounds:

Pitch Pulse train

source

Noise source

Vocal tract model

Amplitude

p[n]

The Source-Filter Model For Speech

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 3

2 • For a continuous function of time f(t), we define its Fourier

Transform by

• is the complex oscillation (with the

frequency ω): For t between 0 and 2π/ω, the point eiωt

moves around the unit circle in the

complex plane. i is the imaginary unit.

Continuous-Time Fourier Transform

dtetfF ti )()(

tite ti sincos

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 3

3 • One can interpret F(ω) as the content of frequency ω in the

function f(t).

• The Fourier transform is invertible:

• This means that no information is lost by the transformation!

Continuous-Time Fourier Transform

deFtf ti)(2

1)(

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 3

4

We can express the Fourier transform F of a function f in polar coordinates:

Then

is called the spectrum,

and

is the amplitude spectrum

and

is the phase spectrum

and

is the power spectrum

• Often, when we say "spectrum" we actually mean "power spectrum".

• It is general consensus that the phase spectrum in not important for

speech recognition.

Continuous-Time Fourier Transform (CTFT)

)sin()cos( tite ti

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 3

5 • The Fourier transform does have some nice features:

Often, the effect of a channel on a signal can be described as a

convolution in the time domain.

Then the corresponding effect in the frequency domain becomes a

multiplication.

And in the log-frequency domain we get a simple summation.

Features of the Fourier Transform

Time shift

Frequency Shift

Convolution

Linearity

Differentiation

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 3

6 • We can transfer the idea of the Fourier transform to a discrete

signal. Then we get the Discrete Time Fourier Transform of a

sequence x[k]:

• DTFT is periodic with period 2 !

Thus we use only an interval of this length of for the inverse

transformation:

• Note: DTFT transforms a discrete signal into a continuous

function.

Discrete Time Fourier Transform (DTFT)

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 3

7

The Dirac distribution is defined

as a single impulse with the area 1.0.

Its Fourier transform is 1.0

A discrete function s[n]

can be expressed

as the weighted sum of

Dirac distributions:

Consequently, we can

define the

discrete Fourier

transform as a sum:

Discrete Time Fourier Transform (DTFT)

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 3

8 • The convolution of two functions f(t) * g(t) is defined as:

• or in the discrete case:

• Convolution is a commutative operation!

Convolution

dgtftgf )()())((

i

kgkifkgf ][][])[(

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 3

9 • Computing the Fourier Transform of

on both sides, we get

i.e. the convolution becomes a multiplication in frequency domain!

(Why?)

Convolution

dgtftgfth )()())(()(

)()()( GFH

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 4

0 • The convolution of a function with an impulse does not change the

function

• The convolution of a function with an impulse train causes identical

copies of the function, if the impulses are located far apart.

Convolution of a Function with an Impulse Train

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 4

1

Overview

• Frequency Representations in Continuous and Discrete Time (II)

– The Fourier Transform – Example

– Examples of CTFTs and DTFTs

– Correspondence Between Frequency In Discrete And Continuous Time

– Short-Term Spectral Analysis

– Short-time Fourier Analysis

– Windowing

• Two Popular Window Shapes

• The Effect of Applying Windows

• Breaking Up Incoming Speech Into Frames Using Hamming Windows

• Time and Frequency Response of Windows

• Speech Visualisation with Spectrograms

• Effect of Window Duration

• Using Filterbanks

– The Mel Scale

– The Bark Scale

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 4

2

The Fourier Transform – Example

• Example for the Fourier Transform:

Left: audio signal, right: frequency spectrum

The audio signal has been samples with 8 kHz

Thus the spectrum has a fundamental frequency of 4 KHz

The vertical axis in the frequency spectrum is logarithmically.

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 4

3

Examples of CTFTs and DTFTs

• Continuous-time decaying exponential:

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

Time, seconds

0 5 10 15 20 25 30 35 40 45 500

0.5

1

Frequency, Hz

0 5 10 15 20 25 30 35 40 45 50-2

-1.5

-1

-0.5

0

Frequency, Hz

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 4

4 • Discrete-time decaying exponential:

0 5 10 15 20 25 30 35 40 45 500

0.5

1

Time, seconds

0 0.5 1 1.5 2 2.5 3 3.5 40.5

1

1.5

2

Frequency, Radians/pi

0 0.5 1 1.5 2 2.5 3 3.5 4-1

-0.5

0

0.5

1

Frequency, Radians/pi

Examples of CTFTs and DTFTs

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 4

5

Correspondence Between Frequency In Discrete And Continuous Time

• Suppose that a continuous-time signal is sampled at a rate

greater than the Nyquist rate, with a time between samples of T

• Let W represent continuous-time frequency in radians/sec

• Let represent discrete-time frequency in radians

• Then …

• Comment:

The maximum discrete-time frequency, w = p, corresponds to the

Nyquist frequency in continuous time, half the sampling rate

WT

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 4

6

Facts:

• The frequency distribution

over an entire utterance does

not help much for recognition.

• Most acoustic events (e.g.

phonemes) have durations in

the range of 10 to 100 ms.

• Many acoustic events are

not static (diphtongs) and

need more detailed analysis.

Solution:

• Partition the entire recording

in a sequence of short segments

• The segments may overlap each

other

Short-Term Spectral Analysis

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 4

7

Short-time Fourier Analysis

• Problem: Conventional Fourier analysis does not capture time-varying nature of speech signals

• Solution: Multiply signals by finite-duration window function, then compute DTFT:

• Side effect: Windowing causes spectral blurring

X(ej

) x[n]e jn

n

X[n, ] x[m]w[nm]e jm

m0

N1

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 4

8

Windows Shapes

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 4

9

Two Popular Window Shapes

• Rectangular window:

• Hamming window:

• Comment: Hamming window is frequently preferred because of its frequency response

w[n]1, 0 n N

0, otherwise

w[n].54 .46cos(2n / N), 0 n N

0,otherwise

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 5

0

• When we compute the Fourier transform of a wn-windowed signal

s(m)[n], we get:

• Now let W be the Fourier transform of the window,

then we can express the short time spectrum in terms of a convolution:

• So the application of a window to the speech signal before computing the

short time spectrum has the same effect as if we computed the

convolution of the original short time spectrum with the window's fourier

transform.

The Effect of Applying Windows

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 5

1

• A window has the effect of "smoothing" the resulting spectrum

• The stepsize (when to compute a short time spectrum) and the window size do not have

to be the same. (Often: window size = 2 · stepsize)

• Most often used window shape is Hamming window

Signal Fourier Transform

of the Window

Fourier Transform of

the Windowed Signal

Rectangle

Window

Hamming

Window

The Effect of Applying Windows

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 5

2

Breaking Up Incoming Speech Into Frames Using

Hamming Windows

20 40 60 80 100 120-0.01

0

0.01

0.02

20 40 60 80 100 120-0.01

0

0.01

0.02

20 40 60 80 100 120-0.01

0

0.01

0.02

20 40 60 80 100 120-0.01

0

0.01

0.02

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 5

3

Time and Frequency Response of Windows

0 10 20 30 400

0.2

0.4

0.6

0.8

1

n

Rectangular Window

0 0.2 0.4 0.6 0.8 1-30

-20

-10

0

10

20

30

40

/

0 10 20 30 400

0.2

0.4

0.6

0.8

1

n

Hamming Window

0 0.2 0.4 0.6 0.8 1-80

-60

-40

-20

0

20

40

/

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 5

4

Speech Visualisation with Spectrograms

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 5

5

Effect of Window Duration

Short-duration window: Long-Duration window:

Time

Fre

qu

en

cy

0 0.2 0.4 0.6 0.8 1 1.2

0

1000

2000

3000

4000

5000

Time

Fre

qu

en

cy

0 0.2 0.4 0.6 0.8 1 1.20

1000

2000

3000

4000

5000

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 5

6

• All Fourier coefficients reflect too much of the signals microstructure

• The microstructure contains redundancies but also a lot of "misleading"

information

• Solution – Filterbanks: The human ear also works with "filterbanks"

• Different approaches to computing filterbank coefficients:

Fixed width

filters:

Variable

width:

Overlapping

filters:

Using Filterbanks

• Typical filterbanks: mel or bark scales

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 5

7 • Mel: Perceptual scale of pitches (MELody)

Proposed by Stevens Volkman and Newman 1937

Let people listen to tones and judge what pitch they perceive

Find pitches to be perceived equal in distance from one another

Compare it with the actual pitch and define a function

Result = melscale: 1 mel = 1127 log (1+ f/700)

Above 500Hz larger and larger

intervals are judged to be equal

pitch increments

(4 hertz octaves correspond to

two mel octaves)

Then: make filterbanks with equal

bandwidths in mels

The Mel Scale

Source: Wikipedia

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 5

8

• Experiment (by Barkhausen):

Let person listen to two frequencies

If the frequencies differ significantly, then person does notice this

If the frequencies are close, the person can hear only one frequency

The minimal frequency-distance (critical bandwidth) is called 1 bark

The scale ranges from 1 to 24 and corresponds to the first 24

critical bands of hearing.

The subsequent band edges are (in Hz) 20, 100, 200, 300, 400,

510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700,

3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, 15500.

The Bark Scale

Sig

nal P

rocessin

g f

or

Speech A

pplic

ations

- 5

9

Thanks for your interest!

Documents

Signal Processing For Speech Applications - Part 1 · Signal Processing For Speech Applications ... Schafer/Rabiner in Waibel/Lee ... Speech signals are usually in the range between