Model Based Speech Enhancement and Coding12190/FULLTEXT01.pdf · 2007-06-29 · system provide improved speech quality, particularly for non-stationary noise sources. In Paper C,

Thesis for the degree of Doctor of Philosophy

Model Based Speech Enhancement and

Coding

David Yuheng Zhao

赵羽珩

Sound and Image Processing LaboratorySchool of Electrical Engineering

KTH (Royal Institute of Technology)

Stockholm 2007

Zhao, David YuhengModel Based Speech Enhancement and Coding

Copyright c©2007 David Y. Zhao except whereotherwise stated. All rights reserved.

ISBN 978-91-628-7157-4TRITA-EE 2007:018ISSN 1653-5146

Sound and Image Processing LaboratorySchool of Electrical EngineeringKTH (Royal Institute of Technology)SE-100 44 Stockholm, SwedenTelephone + 46 (0)8-790 7790

Abstract

In mobile speech communication, adverse conditions, such as noisy acoustic environmentsand unreliable network connections, may severely degrade the intelligibility and natural-ness of the received speech quality, and increase the listening effort. This thesis focuseson countermeasures based on statistical signal processing techniques. The main bodyof the thesis consists of three research articles, targeting two specific problems: speechenhancement for noise reduction and flexible source coder design for unreliable networks.

Papers A and B consider speech enhancement for noise reduction. New schemesbased on an extension to the auto-regressive (AR) hidden Markov model (HMM) forspeech and noise are proposed. Stochastic models for speech and noise gains (excitationvariance from an AR model) are integrated into the HMM framework in order to improvethe modeling of energy variation. The extended model is referred to as a stochastic-gainhidden Markov model (SG-HMM). The speech gain describes the energy variations of thespeech phones, typically due to differences in pronunciation and/or different vocalizationsof individual speakers. The noise gain improves the tracking of the time-varying energy ofnon-stationary noise, e.g., due to movement of the noise source. In Paper A, it is assumedthat prior knowledge on the noise environment is available, so that a pre-trained noisemodel is used. In Paper B, the noise model is adaptive and the model parameters areestimated on-line from the noisy observations using a recursive estimation algorithm.Based on the speech and noise models, a novel Bayesian estimator of the clean speechis developed in Paper A, and an estimator of the noise power spectral density (PSD) inPaper B. It is demonstrated that the proposed schemes achieve more accurate modelsof speech and noise than traditional techniques, and as part of a speech enhancementsystem provide improved speech quality, particularly for non-stationary noise sources.

In Paper C, a flexible entropy-constrained vector quantization scheme based on Gaus-sian mixture model (GMM), lattice quantization, and arithmetic coding is proposed. Themethod allows for changing the average rate in real-time, and facilitates adaptation tothe currently available bandwidth of the network. A practical solution to the classicalissue of indexing and entropy-coding the quantized code vectors is given. The proposedscheme has a computational complexity that is independent of rate, and quadratic withrespect to vector dimension. Hence, the scheme can be applied to the quantization ofsource vectors in a high dimensional space. The theoretical performance of the schemeis analyzed under a high-rate assumption. It is shown that, at high rate, the schemeapproaches the theoretically optimal performance, if the mixture components are locatedfar apart. The practical performance of the scheme is confirmed through simulations onboth synthetic and speech-derived source vectors.

Keywords: statistical model, Gaussian mixture model (GMM), hidden Markovmodel (HMM), noise reduction, speech enhancement, vector quantization.

i

List of Papers

The thesis is based on the following papers:

[A] D. Y. Zhao and W. B. Kleijn, “HMM-based gain-modeling forenhancement of speech in noise,” In IEEE Trans. Audio, Speechand Language Processing, vol. 15, no. 3, pp. 882–892, Mar.2007.

[B] D. Y. Zhao, W. B. Kleijn, A. Ypma, and B. de Vries, “On-linenoise estimation using stochastic-gain HMM for speech enhance-ment,” IEEE Trans. Audio, Speech and Language Processing,submitted, Apr. 2007.

[C] D. Y. Zhao, J. Samuelsson, and M. Nilsson, “Entropy-constrained vector quantization using Gaussian mixture mod-els,” IEEE Trans. Communications, submitted, Apr. 2007.

iii

In addition to papers A-C, the following papers have also beenproduced in part by the author of the thesis:

[1] D. Y. Zhao, J. Samuelsson, and M. Nilsson, “GMM-basedentropy-constrained vector quantization,” in Proc. IEEE Int.Conf. Acoustics, Speech and Signal Processing, Apr. 2007, pp.1097–1100.

[2] V. Grancharov, D. Y. Zhao, J. Lindblom, and W. B. Kleijn,“Low complexity, non-intrusive speech quality assessment,” inIEEE Trans. Audio, Speech and Language Processing, vol. 14,pp. 1948–1956, Nov. 2006.

[3] ——, “Non-intrusive speech quality assessment with low compu-tational complexity,” in Interspeech - ICSLP, September 2006.

[4] D. Y. Zhao and W. B. Kleijn, “HMM-based speech enhance-ment using explicit gain modeling,” in Proc. IEEE Int. Conf.Acoustics, Speech and Signal Processing, vol. 1, May 2006, pp.161–164.

[5] D. Y. Zhao, W. B. Kleijn, A. Ypma, and B. de Vries, “Methodand apparatus for improved estimation of non-stationary noisefor speech enhancement,” 2005, provisional patent applicationfiled by GN ReSound.

[6] D. Y. Zhao and W. B. Kleijn, “On noise gain estimation forHMM-based speech enhancement,” in Proc. Interspeech, Sep.2005, pp. 2113–2116.

[7] ——, “Multiple-description vector quantization using translatedlattices with local optimization,” in Proceedings IEEE GlobalTelecommunications Conference, vol. 1, 2004, pp. 41 – 45.

[8] J. Plasberg, D. Y. Zhao, and W. B. Kleijn, “The sensitivitymatrix for a spectro-temporal auditory model,” in Proc. 12thEuropean Signal Processing Conf. (EUSIPCO), 2004, pp. 1673–1676.

iv

Acknowledgements

Approaching the end of my Ph.D. study, I would like to thank my supervisor,Prof. Bastiaan Kleijn, for introducing me to the world of academic research.Your creativity, dedication, professionalism and hardworking nature havegreatly inspired and influenced me. I also appreciate your patience withme, and that you spend many hours of your valuable time for correcting myEnglish mistakes.

I am grateful to all my current and past colleges at the Sound and Im-age Processing lab. Unfortunately, the list of names is too long to be givenhere. Thank you for creating a pleasant and stimulating work environment.I particularly enjoyed the cultural diversity and the feeling of an interna-tional “family” atmosphere. I am thankful to my co-authors Dr. JonasSamuelsson, Dr. Mattias Nilsson, Dr. Volodya Grancharov, Jan Plasberg,Dr. Jonas Lindblom, Dr. Alexander Ypma, and Dr. Bert de Vries. Thankyou for our fruitful collaborations. Further more, I would like to express mygratitude to Prof. Bastiaan Kleijn, Dr. Mattias Nilsson, Dr. Volodya Gran-charov, Tiago Falk and Anders Ekman for proofreading the introduction ofthe thesis.

During my Ph.D. study, I was involved in projects financially supportedby GN ReSound and the Foundation for Strategic Research. I would liketo thank Dr. Alexander Ypma, Dr. Bert de Vries, Prof. Mikael Skoglund,Prof. Gunnar Karlsson, for your encouragement and friendly discussionsduring the projects.

Thanks to my Chinese fellow students and friends at KTH, particularlyXi, Jing, Lei, and Jinfeng. I feel very lucky to get to know you. The fun wehave had together is unforgettable.

I would also like to express my deep gratitude to my family, my wifeLili, for your encouragement, understanding and patience; my beloved kids:Andrew and whom yet-to-be-named, for the joy and hope you bring. Ithank my parents, for your sacrifice and your endless support; my brother,for all the venture and joy we have together; my mother-in-law, for takingcare of Andrew. Last but not least, I am grateful to our host family, Harleyand Eva, for your kind hearts and unconditional friendship.

David Zhao, Stockholm, May 2007

v

Contents

Abstract i

List of Papers iii

Acknowledgements v

Contents vii

Acronyms xi

I Introduction 11 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1 Short-term models . . . . . . . . . . . . . . . . . . . 41.2 Long-term models . . . . . . . . . . . . . . . . . . . 61.3 Model-parameter estimation . . . . . . . . . . . . . . 9

2 Speech enhancement for noise reduction . . . . . . . . . . . 122.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . 142.2 Short-term model based approach . . . . . . . . . . 172.3 Long-term model based approach . . . . . . . . . . . 21

3 Flexible source coding . . . . . . . . . . . . . . . . . . . . . 253.1 Source coding . . . . . . . . . . . . . . . . . . . . . . 253.2 High-rate theory . . . . . . . . . . . . . . . . . . . . 303.3 High-rate quantizer design . . . . . . . . . . . . . . . 32

4 Summary of contributions . . . . . . . . . . . . . . . . . . . 35References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

II Included papers 51

A HMM-based Gain-Modeling for Enhancement of Speech inNoise A11 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . A1

vii

2 Signal model . . . . . . . . . . . . . . . . . . . . . . . . . . A42.1 Speech model . . . . . . . . . . . . . . . . . . . . . . A42.2 Noise model . . . . . . . . . . . . . . . . . . . . . . . A52.3 Noisy signal model . . . . . . . . . . . . . . . . . . . A6

3 Speech estimation . . . . . . . . . . . . . . . . . . . . . . . A74 Off-line parameter estimation . . . . . . . . . . . . . . . . . A95 On-line parameter estimation . . . . . . . . . . . . . . . . . A116 Experiments and results . . . . . . . . . . . . . . . . . . . . A13

6.1 System implementation . . . . . . . . . . . . . . . . A136.2 Experimental setup . . . . . . . . . . . . . . . . . . . A146.3 Evaluation of the modeling accuracy . . . . . . . . . A156.4 Objective evaluation of MMSE waveform estimators A166.5 Objective evaluation of sample spectrum estimators A196.6 Perceptual quality evaluation . . . . . . . . . . . . . A19

7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . A22Appendix I. EM based solution to (12) . . . . . . . . . . . . . . . A22Appendix II. Approximation of fs(g

′n|xn) . . . . . . . . . . . . . A23

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A24

B On-line Noise Estimation Using Stochastic-Gain HMM forSpeech Enhancement B11 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . B12 Noise model parameter estimation . . . . . . . . . . . . . . B4

2.1 Signal model . . . . . . . . . . . . . . . . . . . . . . B42.2 On-line parameter estimation . . . . . . . . . . . . . B62.3 Safety-net state strategy . . . . . . . . . . . . . . . . B92.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . B10

3 Noise power spectrum estimation . . . . . . . . . . . . . . . B114 Experiments and results . . . . . . . . . . . . . . . . . . . . B12

4.1 System Implementation . . . . . . . . . . . . . . . . B124.2 Experimental setup . . . . . . . . . . . . . . . . . . . B134.3 Evaluation of noise model accuracy . . . . . . . . . . B144.4 Evaluation of noise PSD estimate . . . . . . . . . . . B174.5 Evaluation of speech enhancement performance . . . B184.6 Evaluation of safety-net state strategy . . . . . . . . B21

5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . B22Appendix I. Derivation of (13-14) . . . . . . . . . . . . . . . . . . B25References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B27

C Entropy-Constrained Vector Quantization Using GaussianMixture Models C11 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . C12 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . C4

2.1 Problem formulation . . . . . . . . . . . . . . . . . . C4

viii

2.2 High-rate ECVQ . . . . . . . . . . . . . . . . . . . . C42.3 Gaussian mixture modeling . . . . . . . . . . . . . . C5

3 Quantizer design . . . . . . . . . . . . . . . . . . . . . . . . C54 Theoretical analysis . . . . . . . . . . . . . . . . . . . . . . C6

4.1 Distortion analysis . . . . . . . . . . . . . . . . . . . C64.2 Rate analysis . . . . . . . . . . . . . . . . . . . . . . C74.3 Performance bounds . . . . . . . . . . . . . . . . . . C74.4 Rate adjustment . . . . . . . . . . . . . . . . . . . . C10

5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . C105.1 Arithmetic coding . . . . . . . . . . . . . . . . . . . C105.2 Encoding for the Z-lattice . . . . . . . . . . . . . . . C115.3 Decoding for the Z-lattice . . . . . . . . . . . . . . . C125.4 Encoding for arbitrary lattices . . . . . . . . . . . . C125.5 Decoding for arbitrary lattices . . . . . . . . . . . . C135.6 Lattice truncation . . . . . . . . . . . . . . . . . . . C13

6 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . C147 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . C158 Experiments and results . . . . . . . . . . . . . . . . . . . . C15

8.1 Artificial source . . . . . . . . . . . . . . . . . . . . . C168.2 Speech derived source . . . . . . . . . . . . . . . . . C19

9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . C21References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C22

ix

Acronyms

3GPP The Third Generation Partnership Project

AMR Adaptive Multi-Rate

AR Auto-Regressive

AR-HMM Auto-Regressive Hidden Markov Model

CCR Comparison Category Rating

CDF Cumulative Distribution Function

DFT Discrete Fourier Transform

ECVQ Entropy-Constrained Vector Quantizer

EM Expectation Maximization

EVRC Enhanced Variable Rate Codec

FFT Fast Fourier Transform

GMM Gaussian Mixture Model

HMM Hidden Markov Model

IEEE Institute of Electrical and Electronics Engineers

I.I.D. Independent and Identically-Distributed

IP Internet Protocol

KLT Karhunen-Loeve Transform

LL Log-Likelihood

LP Linear Prediction

LPC Linear Prediction Coefficient

LSD Log-Spectral Distortion

xi

LSF Line Spectral Frequency

MAP Maximum A-Posteriori

ML Maximum Likelihood

MMSE Minimum Mean Square Error

MOS Mean Opinion Score

MSE Mean Square Error

MVU Minimum Variance Unbiased

PCM Pulse-Code Modulation

PDF Probability Density Function

PESQ Perceptual Evaluation of Speech Quality

PMF Probability Mass Function

PSD Power Spectral Density

RCVQ Resolution-Constrained Vector Quantizer

REM Recursive Expectation Maximization

SG-HMM Stochastic-Gain Hidden Markov Model

SMV Selectable Mode Vocoder

SNR Signal-to-Noise Ratio

SSNR Segmental Signal-to-Noise Ratio

SS Spectral Subtraction

STSA Short-Time Spectral Amplitude

VAD Voice Activity Detector

VoIP Voice over IP

VQ Vector Quantizer

WF Wiener Filtering

WGN White Gaussian Noise

xii

Part I

Introduction

Introduction

The advance of modern information technology is revolutionizing the waywe communicate. Global deployment of mobile communication systems hasenabled speech communication from and to nearly every corner of the world.Speech communication systems based on the Internet protocol (IP), the so-called voice over IP (VoIP), are rapidly emerging. The new technologiesallow for communication over the wide range of transport media and proto-cols that is associated with the increasingly heterogeneous communicationnetwork environment. In such a complex communication infrastructure,adverse conditions, such as noisy acoustic environments and unreliable net-work connections, may severely degrade the intelligibility and naturalnessof the perceived speech quality, and increase the required listening effort forspeech communication. The demand for better quality in such situationshas given rise to new problems and challenges. In this thesis, two specificproblems are considered: enhancement of speech in noisy environments andsource coder design for unreliable network conditions.

Noise

EnhancementSource

encoding

Channel

encoding

Speaker

Source

decoding

Channel

decodingListener

Network

Feedback

Figure 1: A one-direction example of speech communication in a noisyenvironment.

Figure 1 illustrates an adverse speech communication scenario that is

2 Introduction

typical of mobile communications. The speech signal is recorded in anenvironment with a high level of ambient noise such as on a street, in arestaurant, or on a train. In such a scenario, the recorded speech signal iscontaminated by additive noise and has degraded quality and intelligibilitycompared to the clean speech. Speech quality is a subjective measure on howpleasant the signal sounds to the listeners. It can be related to the amountof effort required by listeners to understand the speech. Speech intelligibil-ity can be measured objectively based on the percentage of sounds that arecorrectly understood by listeners. Naturally, both speech quality and intel-ligibility are important factors in the subjective rating of a speech signal bya listener. The level of background noise may be suppressed by a speechenhancement system. However, suppression of the background noise maylead to reduced speech intelligibility, due to possible speech distortion as-sociated with excessive noise suppression. Therefore, most practical speechenhancement systems for noise reduction are designed to improve the speechquality, while preserving the speech intelligibility [6, 60,61]. Besides speechcommunication, speech enhancement for noise reduction may be applied toa wide range of other speech processing applications in adverse situations,including speech recognition, hearing aids and speaker identification.

Another limiting factor for the quality of speech communication is thenetwork condition. It is a particularly relevant issue for wireless communica-tion, as the network capacity may vary depending on interferences from thephysical environment. The same issue occurs in heterogeneous networkswith a large variety of different communication links. Traditional sourcecoder design is often optimized for a particular network scenario, and maynot perform well in another network. To allow communication with thebest possible quality in the new network, it is necessary to adapt the codingstrategy, e.g., coding rate, depending on the network condition.

This thesis concerns statistical signal processing techniques that facili-tate better solutions to the aforementioned problems. Formulated in a sta-tistical framework, speech enhancement is an estimation problem. Sourcecoding can be seen as a constrained optimization problem that optimizesthe coder under certain rate constraints. Statistical modeling of speech (andnoise) is a key element for solving both problems.

The techniques proposed in this thesis are based on a hidden Markovmodel (HMM) or a Gaussian mixture model (GMM). Both are generic mod-els capable of modeling complex data sources such as speech. In Papers Aand Paper B, we propose an extension to an HMM that allows for moreaccurate modeling of speech and noise. We show that the improved modelsalso lead to an improved speech enhancement performance. In Paper C, wepropose a flexible source coder design for entropy-constrained vector quan-tization. The proposed coder design is based on Gaussian mixture modelingof the source.

The introduction is organized as follows. In section 1, an overview of

1 Statistical models 3

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 110

0

102

104

106

Time domain speech amplitude

Num

ber

of s

ampl

es

Figure 2: Histogram of time-domain speech waveform samples. Gener-ated using 100 utterances from the TIMIT database, down-sampled to 8kHz, normalized between -1 and 1, and 2048 bins.

statistical modeling for speech is given. We discuss short-term models formodeling the local statistics of a speech frame, and long-term models (suchas an HMM or a GMM) that describe the long-term statistics spanning overmultiple frames. Parameter estimation using the expectation-maximization(EM) algorithm is discussed. In section 2, we introduce speech enhancementtechniques for noise reduction, with our focus on statistical model-basedmethods. Different approaches using short-term and long-term models arereviewed and discussed. Finally, in section 3, an overview of theory andpractice for flexible source coder design is given. We focus on the high-ratetheory and coders derived from the theory.

1 Statistical models

Statistical modeling of speech is a relevant topic for nearly all speech pro-cessing applications, including coding, enhancement, and recognition. Astatistical model can be seen as a parameterized set of probability densityfunctions (PDFs). Once the parameters are determined, the model pro-vides a PDF of the speech signal that describes the stochastic behavior ofthe signal. A PDF is useful for designing optimal quantizers [64, 91, 128],estimators [35,37–39,90], and recognizers [11,105,108] in a statistical frame-work.

One can make statistical models of speech in different representations.The simplest one applies directly to the digital speech samples in the wave-form domain (analog speech is out of the scope of this thesis). For instance,a histogram of speech waveform samples, as shown in Figure 2, provides

4 Introduction

the long-term averaged distribution of the amplitudes of speech samples.A similar figure is shown in [135]. The distribution is useful for designingan entropy code for a pulse-code modulation (PCM) system. However, insuch a model, statistical dependencies among the speech samples are notexplicitly modeled.

It is more common to model speech signal as a stochastic process thatis locally stationary within 20-30 ms frames. The speech signal is then pro-cessed on a frame-by-frame basis. Statistical models in speech processingcan be roughly divided into two classes, short-term and long-term signalmodels, depending on their scope of operation. A short-term model de-scribes the statistics of the vector for a particular frame. As the statisticsare changing over the frames, the parameters of a short-term model are tobe determined for each frame. On the other hand, a model that describesthe statistics over multiple signal frames is referred to as a long-term model.

1.1 Short-term models

It is essential to model the sample dependencies within a speech frame of 20-30 ms. This short-term dependency determines the formants of the speechframe, which are essential for recognizing the associated phoneme. Modelingof this dependency is therefore of interest to many speech applications.

Auto-regressive model

An efficient tool for modeling the short-term dependencies in the time do-main is the auto-regressive (AR) model. Let x[i] denote the discrete-timespeech signal, the AR model of x[i] is defined as

x[i] = −p∑

j=1

α[j]x[i − j] + e[i], (1)

where α are the AR model parameters, also called the linear predictioncoefficients, p is the model order and e[i] is the excitation signal, modeledas white Gaussian noise (WGN). An AR process is then modeled as theoutput of filtering WGN through an all-pole AR model filter.

For a signal frame of K samples, the vector x = [x[1], . . . , x[K]] can beseen as a linearly transformation of a WGN vector e = [e[1], . . . , e[K]], byconsidering the filtering process as a convolution formulated as a matrixtransform. Consequently, the PDF of x can be modeled as a multi-variateGaussian PDF, with zero-mean and a covariance matrix parameterized usingthe AR model parameters. The AR Gaussian density function is defined as

f(x) =1

(2πσ2)K

2

exp

(

−1

2xT D−1x

)

, (2)


Pitch period

Time-varying

filter

Pulse-train

generator

Noise

generator

Gain

Gain

Speech

signal

Figure 3: Schematic diagram of a source-filter speech production model.

with the covariance matrix D = σ2(A]A)−1, where A is a K × K lowertriangular Toeplitz matrix with the first p + 1 elements of the first columnconsisting of the AR coefficients [α[0], α[1], α[2], . . . , α[p]]

T, α[0] = 1, and

the remaining elements equal to zero, ] denotes the Hermitian transpose,and σ2 denotes the excitation variance. Due to the structure of A, thecovariance matrix D has the determinant σ2K .

Applying the AR model for speech processing is often motivated andsupported by the physical properties of speech production [43]. A commonlyused model of speech production is the source-filter model as shown inFigure 3. The excitation signal, modeled using a pulse train for a voicedsound or Gaussian noise for an unvoiced sound, is filtered through a time-varying all-pole filter that models the vocal tract. Due to the absence of apitch model, the AR Gaussian model is more accurate for unvoiced sounds.

Frequency-domain models

The spectral properties of a speech signal can be analyzed in the frequencydomain through the short-time discrete Fourier transform (DFT). On aframe-by-frame basis, each speech segment is transformed into the frequencydomain by applying a sliding window to the samples and a DFT on thewindowed signal.

A commonly used model is the complex Gaussian model, motivated bythe asymptotic properties of the DFT coefficients. Assuming that the DFTlength approaches infinity and that the span of correlation of the signalframe is short compared to the DFT length, the DFT coefficients may beconsidered as a weighted sum of weakly dependent random samples [51,94,99,100]. Using the central limit theorem for weakly dependent variables [12,130], the DFT coefficients across frequency can be considered as independent

6 Introduction

zero-mean Gaussian random variables. For k 6∈ {1, K2 + 1}, the PDF of the

DFT coefficients of the k’th frequency bin X[k] is given by

f(X[k]) =1

πλ2[k]exp

(

−|X[k]|2λ2[k]

)

, (3)

where λ2[k] = E[|X[k]|2] is the power spectrum. For k ∈ {1, K2 +1}, X[k] is

real and has a Gaussian distribution with zero mean and variance E[|X[k]|2].Due to the conjugate symmetry of DFT for real signals, only half of the DFTcoefficients need to be considered.

One consequence of the model is that each frequency bin may be ana-lyzed and processed independently. Although, for typical speech process-ing DFT lengths, the independence assumption does not always hold, it iswidely used, since it significantly simplifies the algorithm design and reducesthe associated computational complexity. For instance, the complex Gaus-sian model has been successfully applied in speech enhancement [38,39,92].

The frequency domain model of an AR process can be derived underthe asymptotic assumption. Assuming that the frame length approachesinfinity, A is well approximated by a circulant matrix (neglecting the frameboundary), and is approximately diagonalized by the discrete Fourier trans-form. Therefore, the frequency domain AR model is of the form of (3), andthe power spectrum λ2[k] is given by

λ2[k]AR =σ2

|∑pn=0 α[n]e−jnk|2 . (4)

It has been argued [89, 103, 118] that supergaussian density functions,such as the Laplace density and the two-sided Gamma density, fit better tospeech DFT coefficients. However, the experimental studies were based onthe long-term averaged behavior of speech, and may not necessarily holdfor short-time DFT coefficients. Speech enhancement methods based onsupergaussian density function have been proposed in [19, 85, 86, 89, 90].The comprehensive evaluations in [90] suggest that supergaussian methodsachieve consistent but small improvement in terms of higher segmental SNRresults relative to the Wiener filtering approach (which is based on the Gaus-sian model). However, in terms of perceptual quality, the Ephraim-Malahshort-time spectral amplitude estimators [38, 39] based on the Gaussianmodel were found to provide more pleasant residual noise in the processedspeech [90].

1.2 Long-term models

With a long-term model, we mean a model that describes the long-termstatistics that span over multiple signal frames of 20-30 ms. The short-term models in section 1.1 apply to one signal frame only, and the model


parameters obtained for a frame may not generalize to another frame. Along-term model, on the other hand, is generic and flexible enough to allowstatistical variations in different signal frames. To model different short-term statistics in different frames, a long-term model comprises multiplestates, each containing a sub-model describing the short-term statistics of aframe. Applied to speech modeling, each sub-model represents a particularclass of speech frames that are considered statistically similar. Some ofthe most well-known long-term models for speech processing are the hiddenMarkov model (HMM), and the Gaussian mixture model (GMM).

Hidden Markov model

The hidden Markov model (HMM) was originally proposed by Baum andPetrie [13] as probabilistic functions of finite state Markov chains. HMMapplied for speech processing can be found in automatic speech recognition[11,65,105] and speech enhancement [35,36,111,147].

An HMM models the statistics of an observation sequence, x1, . . . ,xN .The model consists of a finite number of states that are not observable,and the transition from one state to another is controlled by a first-orderMarkov chain. The observation variables are statistically dependent of theunderlying states, and may be considered as the states observed througha noisy channel. Due to the Markov assumption of the hidden states, theobservations are considered to be statistically interdependent to each other.For a realization of state sequence, let s = [s1, . . . , sN ], sn denote the stateof frame n, asn−1sn

denote the transition probability from state sn−1 to sn

with as0s1being the initial state probability, and let fsn

(xn) denote theoutput probability of xn for a given state sn, the PDF of the observationsequence x1, . . . ,xN of an HMM is given by

f(x1, . . . ,xN ) =∑

s∈S

N∏

n=1

asn−1snfsn

(xn), (5)

where the summation is over the set of all possible state sequences S.The joint probability of a state sequence s and the observation sequencex1, . . . ,xN consists of the product of the transition probabilities and theoutput probabilities. Finally, the PDF of the observation sequence is thesum of the joint probabilities over all possible state sequences.

Applied to speech modeling, each sub-model of a state represents theshort-term statistics of a frame. Using the frequency domain Gaussianmodel (3), each state consists of a power spectral density representing aparticular class of speech sounds (with similar spectral properties). TheMarkov chain of states models the temporal evolution of short-term speechpower spectra.

8 Introduction

−1 −0.5 0 0.5 10

0.25

0.5

0.75

1

1.25

x2

f(x 2)

f(x2|x

1=0.5)

−1 −0.5 0 0.5 1

−1

−0.5

0

0.5

1

x1

x 2

f(x1,x

2)

Figure 4: Examples of a one-dimensional GMM (left) and a two-dimensional GMM (right).

Gaussian mixture model

Gaussian mixture modeling is a convenient tool useful in a wide range ofapplications. An early application of GMM was found in adaptive blockquantization applied to image coding [131]. More recent examples of ap-plications include clustering analysis [150], image coding [126], image re-trieval [55], multiple description coding [115], speaker identification [108],speech coding [80, 112, 114], speech quality estimation [42], speech recogni-tion [33,34], and vector quantization [58,127,128,148].

A Gaussian mixture model (GMM) is defined as a weighted sum ofGaussian densities,

f(x) =

M∑

m=1

ρmfm(x), (6)

where m denotes the component index, ρ denote the mixture weights, andfm(x) are the component Gaussian PDFs. Examples of a one-dimensionalGMM and a two-dimensional GMM are shown in Figure 3.

A GMM can be interpreted in two conceptually different ways. It canbe seen as a generic PDF that has the ability of approximate an unknownPDF by adjusting the model parameters over a database of observations.Another interpretation of a GMM is as a special case of an HMM, by as-suming independent and identically-distributed (I.I.D.) random variables,such that f(x1, . . . ,xN ) = f(x1) . . . f(xN ). The model can then be for-mulated as an HMM by assigning the initial state probability according tothe mixture component weights, and assuming the matrix of state transi-tion probabilities to be the identity matrix. In fact, to generate random


observations from a GMM, one can first generate an I.I.D. state sequenceaccording to the mixture component weights, and then generate the obser-vation sequence according to the component PDF of each state sample. Thegenerated observations will be distributed according to the GMM. Appliedto speech modeling, each component may represent a class of speech sounds,the weights are according to the probability of their appearance, and eachcomponent PDF describes the power spectrum of the sound class.

The main difference between a GMM and an HMM with Gaussian sub-models is that, in a GMM, consecutive frames are considered independent,while in an HMM, the Markov assumption of the underlying state sequencemodels their temporal dependency. An HMM is therefore a more generaland powerful model than a GMM. Nevertheless, GMMs are popular modelsin speech processing due to their simplicity in analysis and good modelingcapabilities. In fact, GMMs are often used as sub-models of an HMM inpractical implementations of HMM based methods.

1.3 Model-parameter estimation

Using an HMM or a GMM, the joint PDF of a sequence x1, . . . ,xN , denotedf(x1, . . . ,xN ), is parameterized by unknown parameters, denoted by θ.Each value of θ corresponds to one particular PDF of x1, . . . ,xN . Sincethe true PDF is unknown, the value of θ is to be determined from data.Let θ denote the estimated parameters, as a function of x1, . . . ,xN . Acommonly used criterion of an optimal estimator is that 1) the estimator isunbiased, i.e.,

E[θ] = θ, (7)

and 2) the estimator gives an estimate that has the minimum variance. Suchan estimator is referred to as the minimum variance unbiased (MVU) esti-mator [67]. Unfortunately, the MVU estimator is difficult to determine inmany practical applications. An alternative estimator is the maximum like-lihood (ML) estimator, which maximizes the likelihood function or equiva-lently the logarithm of the likelihood function (log-likelihood function),

θ = arg maxθ

f(x1, . . . ,xN |θ) (8)

= arg maxθ

log f(x1, . . . ,xN |θ), (9)

where f(x1, . . . ,xN |θ)1 denotes the likelihood function, viewed as a functionof θ for the given data sequence x1, . . . ,xN . The ML approach typicallyleads to practical estimators that can be implemented for complex estima-tion tasks. Also, it can be shown [67] that the ML estimator is asymptoti-

1The likelihood function can also be seen as the PDF of x1, . . . ,xN conditioned onthe parameter vector θ.

10 Introduction

cally optimal, and approaches the performance of the MVU estimator whenthe data set is large.

For a generic PDF based on GMM or HMM, directly solving (8) or (9)leads to intractable analytical expressions. A more practical and commonlyused method is the expectation-maximization (EM) algorithm [31].

Expectation-maximization (EM) algorithm

The expectation-maximization (EM) algorithm originated from early workof Dempster et. al [31]. The EM algorithm is used in Papers A and Bfor parameter estimation of an extended HMM model of speech. The EMalgorithm applies to a given batch of data, and is suitable for off-line opti-mization of the model. For other model parameters, e.g., in a noise model,the observation data may not be available off-line and on-line estimationusing a recursive algorithm is necessary. The recursive formulation of theEM algorithm was proposed in [132], and applied to on-line HMM param-eter estimation in [72]. The recursive EM algorithm is used in Paper A forestimation of the gain model parameters, and in Paper B for estimation ofnoise model parameters. Both algorithms are based on the same principleof incomplete data. Herein, an introduction to the theory behind the EMalgorithm is given.

The EM algorithm is an iterative algorithm that ensures non-decreasinglikelihood scores during each iteration step. Since the likelihood function isupper-bounded, the algorithm guarantees convergence to a locally optimalsolution. The EM algorithm is designed for the estimation scenario whenthe observation sequence is incomplete or partially missing. The missingobservation sequence can be either real or artificial. In either case, the EMalgorithm is applicable when the ML estimator leads to intractable analyt-ical formulas but is significantly simplified when assuming the existence ofadditional unknown values.

For notational convenience, let XN1 = {x1, . . . ,xN} denote the obser-

vation data, ZN1 = {z1, . . . , zN} denote the missing observation data. The

complete data is then given by {XN1 ,ZN

1 }, and log f(XN1 ,ZN

1 |θ) is the com-

plete log-likelihood function. Let θ(j) denote the estimated parameter vectorfrom the j’th EM iterations. The EM algorithm consists of two sequentialsub-steps within each iteration. The E-step (expectation step) formulates

the auxiliary function, denoted Q(θ|θ(j)), which is the expected value ofthe complete data log-likelihood with respect to the missing data given theobservation data and the parameter vector estimate of the j’th iteration,

Q(θ|θ(j)) =

∫

ZN

1

f(ZN1 |XN

1 , θ(j)) log(

f(ZN1 ,XN

1 |θ))

dZN1 . (10)

The maximization step in the EM algorithm gives a new estimate of the


parameter vector by maximizing the Q(θ|θ(j)) function of the current step,

θ(j+1) = arg maxθ

Q(θ|θ(j)). (11)

It can be noted that the expectation of (10) is evaluated with respect to

f(ZN1 |XN

1 , θ(j)) using the j’th parameter estimate, and the maximizationof (11) is over θ in the complete log-likelihood function, log f(ZN

1 ,XN1 |θ).

Therefore, it is essential to select ZN1 such that differentiating (11) with

respect to θ and setting the resulting expression to zero should lead totractable solutions for the update equations.

Estimation of both GMM and HMM parameters can be derived using theEM framework. For a GMM, the missing data consists of the componentindices from which the observation data is generated. For an HMM, themissing data is the state sequence. For a detailed derivation of the EMalgorithm for GMM and HMM, we refer to [17]. The EM algorithm and therecursive EM algorithm are extensively used in Papers A and B to estimatethe model parameters of extended HMMs of speech and noise.

Convergence of the EM algorithm

The iterative procedure of the EM algorithm ensures convergence towardsa locally optimal solution. A proof of convergence is given in [31], andthe proof is outlined in this section. We first show that the log-likelihoodfunction (9) over a large data set is upper-bounded. We then show that theEM iterations improve the log-likelihood function in each iteration step.

Assume existence of a data set consisting of a large number of XN1 , where

each data sequence XN1 is distributed according to a ”true” distribution

function f(XN1 ). Note the difference to f(XN

1 |θ), which is a PDF as afunction of an unknown parameter vector θ. The log-likelihood function ofthis data set is

LL = log∏

XN

1

f(XN1 |θ) =

∑

XN

1

log f(XN1 |θ) ≈

∫

XN

1

f(XN1 ) log f(XN

1 |θ)dXN1 . (12)

Comparing the log-likelihood function with the expectation of the loga-rithmic transform of the true PDF

∫

XN

1

f(XN1 ) log f(XN

1 ), we get

LL −∫

XN

1

f(XN1 ) log f(XN

1 ) =

∫

XN

1

f(XN1 ) log

f(XN1 |θ)

f(XN1 )

dXN1

≤∫

XN

1

f(XN1 )

(

f(XN1 |θ)

f(XN1 )

− 1

)

dXN1

= 1 − 1 = 0, (13)

12 Introduction

where the inequality is due to log x ≤ x − 1. Hence, the log-likelihoodfunction is upper bounded by the expectation of the logarithmic transformof the true PDF

∫

XN

1

f(XN1 ) log f(XN

1 ).

The log-likelihood score of the j’th iteration can be written as

log f(XN1 |θ(j)) =

(

log f(XN1 |θ(j))

)

(

∫

ZN

1

f(ZN1 |XN

1 , θ(j))dZN1

)

(14)

=

∫

ZN

1

f(ZN1 |XN

1 , θ(j)) log f(XN1 |θ(j))dZN

1 (15)

=

∫

ZN

1

f(ZN1 |XN

1 , θ(j)) logf(XN

1 ,ZN1 |θ(j))

f(ZN1 |XN

1 , θ(j))dZN

1 , (16)

where the last step uses f(XN1 ,ZN

1 |θ(j)) = f(ZN1 |XN

1 , θ(j))f(XN1 |θ(j)).

The log-likelihood score of the (j + 1)’th iteration can be written as

log f(XN1 |θ(j+1)) = log

(

∫

ZN

1

f(XN1 ,ZN

1 |θ(j+1))dZN1

)

(17)

= log

(

∫

ZN

1

f(ZN1 |XN

1 , θ(j))f(XN

1 ,ZN1 |θ(j+1))

f(ZN1 |XN

1 , θ(j))dZN

1

)

(18)

≥∫

ZN

1

f(ZN1 |XN

1 , θ(j)) log

(

f(XN1 ,ZN

1 |θ(j+1))

f(ZN1 |XN

1 , θ(j))

)

dZN1 , (19)

where the inequality is due to Jensen’s inequality.Comparing the log-likelihood score of the (j + 1)’th iteration to the

previous iteration, we get

log f(XN1 |θ(j+1)) − log f(XN

1 |θ(j))

≥∫

ZN

1

f(ZN1 |XN

1 , θ(j)) log(

f(XN1 ,ZN

1 |θ(j+1)))

dZN1

−∫

ZN

1

f(ZN1 |XN

1 , θ(j)) log(

f(XN1 ,ZN

1 |θ(j)))

dZN1 (20)

= Q(θ(j+1)|θ(j)) −Q(θ(j)|θ(j)) (21)

≥ 0, (22)

where the last step follows from the definition of the maximization step (11).Since the log-likelihood score is upper bounded, the convergence is proven.

2 Speech enhancement for noise reduction

In Papers A and B, we apply the statistical framework for enhancement ofspeech in noisy environments. Noise reduction has become an increasingly

2 Speech enhancement for noise reduction 13

important component in speech communication systems. Based on a statis-tical framework using models of speech and noise, a noise reduction systemcan be formulated as an estimation problem, in which the clean speech is tobe estimated from the noisy observation signal. In estimation theory, theperformance of an estimator is commonly measured as the expected value ofa distortion function that quantifies the difference between the clean speechsignal and the estimated signal. Success of the estimator that minimizes theexpected distortion depends largely on the accuracy of the assumed speechand noise models and the perceptual relevance of the distortion measure.Due to the inherent complexity of natural speech and human perception,the present models and distortion measures are only simplified approxima-tions. A large variety of noise reduction methods has been proposed andtheir differences are often due to different assumptions on the models anddistortion measures.

The models discussed in section 1 can be applied to modeling of thestatistics of speech and noise. In the design of a noise reduction system,additional assumptions on the noisy signal are made. Depending on theenvironment, the interfering noise may be additive, e.g., from a separatenoise generator, or convolutive, e.g., due to room reverberations or fading.Depending on the number of available microphones, the observation dataconsist of one or more noisy signals. The corresponding noise reductionmethods are divided into single-channel (one microphone) and multi-channel(microphone array) systems. The focus of this thesis is on single-channelnoise reduction for an additive noise environment. The noisy speech signalis assumed to be the sum of the speech and noise signals. Furthermore, theadditive noise is often considered to be statistically independent of speech.

A successful noise reduction system may be used as a preprocessor to aspeech communication system or an automatic speech recognition system.In fact, several algorithms have been successfully integrated into standard-ized speech coders for mobile speech communication. Examples of speechcoders with a noise reduction module include the Enhanced Variable Ratecodec (EVRC) [1] and Selectable Mode Vocoder (SMV) [7]. 3GPP (Thethird Generation Partnership Project) has standardized minimum perfor-mance requirements and an evaluation procedure for the Adaptive Multi-Rate (AMR) codec [6], and a number of noise reduction algorithms thatsatisfy the requirements have been developed, e.g., [3–5,8].

The remainder of this section is organized as follows. An overview ofsingle-channel noise reduction is given in section 2.1. The focus is on fre-quency domain approaches that make use of a short-time spectral atten-uation filter. The Bayesian estimation framework is introduced and noiseestimation is discussed. Sections 2.2 and 2.3 provide an overview of classi-cal statistical model based methods using short-term models and long-termmodels, respectively. Sections 2.2 and 2.3 also serves as an introductionto Papers A and B. For a more extensive tutorial on speech enhancement

14 Introduction

algorithms for noise reduction, we refer to [37,78,135].

2.1 Overview

Due to the quasi-stationarity property of speech, the noisy signal is oftenprocessed on a frame-by-frame basis. For additive noise, the noisy speechsignal of the n’th frame, is modeled as

yn[i] = xn[i] + wn[i], (23)

where i is the time index of the n’th frame and y, x, w denote noisy speech,clean speech and noise, respectively. In the vector notation, the noisyspeech, clean speech and noise frames are denoted by yn,xn and wn, respec-tively. In the following sections, the speech model parameters are denotedusing overbar ¯ and the noise model parameters are denoted using doubledots .

Short-time spectral attenuation

It is common to perform estimation in the spectral domain, e.g., throughshort-time Fourier analysis and synthesis. An analysis window is appliedto each time domain signal segment and the DFT is then applied. For then’th frame, the frequency domain signal model is given by

Yn[k] = Xn[k] + Wn[k], (24)

where k denotes the index of the frequency band.For frequency-domain noise reduction, the noisy spectrum is attenuated

in the frequency bins that contain noise. The level of attenuation dependson the signal to noise ratio (SNR) of the frequency bin. The attenuation isperformed by means of an adaptive spectral attenuation filter, denoted byHn[k], that is applied to Yn[k] to obtain an estimate of Xn[k], Xn[k], givenby:

Xn[k] = Hn[k]Yn[k]. (25)

The inverse DFT is applied to Xn = [Xn[1], . . . , Xn[K]]T to obtain theenhanced speech signal frame xn in the time domain. To avoid discontinuityat frame boundaries, the overlap and add approach is used in synthesis. Theschematic diagram of a frequency domain noise reduction method using ashort-time Fourier transform is shown in Figure 5.

Spectral subtraction

One of the classical methods for noise reduction is spectral subtraction[18, 92]. The method is based on a direct estimation of speech spectral


DFTInverse

DFT

Noise

Estimation

Noise

Reduction

yn xn

Figure 5: Noise reduction method using a short-time Fourier transform.

magnitude by subtracting the noise spectral magnitude from the noisy one.The attenuation filter for the spectral subtraction method can be formulatedas

Hn[k]SS =

(

max(|Yn[k]|r − |Wn[k]|r, 0)

|Yn[k]|r

)1r

, (26)

where |Wn[k]| is an estimate of the noise spectrum amplitude, and r =1 for spectral magnitude subtraction [18] and r = 2 for power spectralsubtraction [92]. The power spectral subtraction method is closely relatedto the Wiener filtering approach, with the only difference in the power term(see section 2.2). For both magnitude and power subtractions, the phaseof the noisy spectrum is used unprocessed in construction of the enhancedspeech. One motivation is that the human auditory system is less sensitiveto noise in spectral phase than in spectral magnitude [139], therefore thenoise in spectral phase is less annoying.

Another formulation of the power spectral subtraction [92] is based onthe complex Gaussian model (3) of the noisy spectrum. Under the Gaus-sian assumption, and due to the independence assumption, the noisy powerspectrum of the k’th frequency bin is λ2

n[k] = λ2n[k] + λ2

n[k], where λ2n[k]

and λ2n[k] are the power spectra of speech and noise respectively. The PDF

of Yn[k] is given by

f(Yn[k]) =1

π(λ2n[k] + λ2

n[k])exp

(

− |Yn[k]|2λ2

n[k] + λ2n[k]

)

. (27)

Assuming that an estimate of the noise power spectrumˆλ2

n[k] exists, themaximum likelihood (ML) estimate of the speech power spectrum maxi-mizes f(Yn[k]) with respect to λ2

n[k]. Using the additional constraint thatthe power spectrum is nonnegative, the ML estimate of λ2

n[k] is given by

ˆλ2n[k] = max(|Yn[k]|2 − ˆ

λ2n[k], 0). (28)

16 Introduction

The implementation of the spectral subtraction method is straightfor-ward. However, the processed speech is plagued by the so-called musicalnoise phenomenon. The musical noise is the residual noise containing iso-lated tones located randomly in time and frequency. These tones correspondto spectral peaks in the original noisy signal and are left alone when theirsurrounding spectral bins in the time-frequency plane have been heavilyattenuated. The musical noise is perceptually annoying and several meth-ods have been derived from the spectral subtraction method targeting themusical noise problem [48,83,133,136].

Bayesian estimation approach

Spectral subtraction was originally an empirically motivated method, andis often used in practice because of its simplicity. Alternatively, speech en-hancement for noise reduction can be formulated in a statistical framework,such that prior knowledge of speech and noise is incorporated through priorPDFs. Using additionally a relevant cost function, speech can be estimatedusing the Bayesian estimation framework. The following sections, 2.2 and2.3, are devoted to methods using the Bayesian estimation framework. Inthis subsection, a short introduction to Bayesian estimation is given.

At a conceptual level, assuming that speech and noise are random vectorsdistributed according to PDFs f(x) and f(w), the joint PDF of x and ycan be formulated using the signal model (23). A Bayesian speech estimatorminimizes the Bayes risk which is the expected cost with respect to both xand y, defined as

Bayes risk =

∫ ∫

C(x, x)f(x,y)dxdy, (29)

where C(x, x) is a cost function between x and x.A commonly used cost function is the square error function,

CMMSE(x, x) = ||x − x||2, (30)

and the optimal estimator minimizing the corresponding Bayes risk is theminimum mean square error (MMSE) estimator. The popularity of MMSEbased estimation is due to its mathematical tractability and good perfor-mance. It can be shown that the MMSE speech estimator, e.g., [67], is theconditional expectation of speech given the noisy speech

x =

∫

xf(x|y)dx. (31)

Noise estimation

One key component of a single-channel noise reduction system is estimationof the noise statistics. Often, a speech estimator is designed assuming the


actual noise power spectrum is available, and its optimality is no longerensured when the actual noise power spectrum is replaced by an estimateof the noise spectrum. Therefore, the performance of the speech estimatordepends largely on the quality of the noise estimation algorithm.

A large number of noise estimation algorithms has been proposed. Theyare typically based on the assumption that 1) speech does typically notoccupy the entire time-frequency plane 2) noise is “more stationary” thanspeech. By assuming that noise is stationary over a longer (than a frame)period of time, e.g., up to several seconds, the noise power spectrum can beestimated by averaging the noisy power spectra in the frequency bands withlow or zero speech activity. The update of the noise estimate is commonlycontrolled through a voice-activity detector (VAD), e.g., [1,92], speech pres-ence probability estimate [24, 87, 120], or order statistics [59, 88, 125]. Forinstance, the minimum statistics method [88] is based on the observationthat the noisy power spectrum frequently approaches the noise power spec-trum level due to speech absence in these frequency bins. To obtain theminimum statistics, a buffer is used to store past noisy power spectra ofeach frequency band. Bias compensation for taking into account the fluctu-ations of power spectrum estimation is proposed in [88]. In such a method,the size of the buffer is a trade-off. If the buffer is too small, it may not con-tain any speech inactive frames and speech may be falsely detected as noise.If the buffer is too large, it may not react fast enough for non-stationarynoise types. In [88], the buffer length is set to 1.5 seconds.

2.2 Short-term model based approach

This section provides an overview of some Bayesian speech estimators usingshort-term models as discussed in section 1.1. In particular, we discussWiener filtering and spectral amplitude estimation methods.

Wiener filtering

Wiener filtering [141] is a classical signal processing technique that has beenapplied to speech enhancement. Wiener filters are the optimal linear time-invariant filters for the minimum mean square error (MMSE) criterion, andcan be classified as Bayesian estimators under the constraint of linear es-timation. The Wiener filter is also the optimal MMSE estimator under aGaussian assumption.

Assuming linear estimation, the speech estimate in the time domain isof the form

xn = Hnyn, (32)

18 Introduction

for a linear estimation matrix Hn. The MMSE linear estimator fulfills

HWFn = arg min

H

∫ ∫

||xn − Hyn||2f(xn,yn)dxndyn. (33)

The optimal estimator can be solved by taking the first derivative of(33) with respect to H, and setting the obtained expression to zero. Usingalso the assumption that speech and noise have both zero mean and arestatistically independent, the optimal linear estimator is

HWFn = Dn(Dn + Dn)−1, (34)

where Dn and Dn are the covariance matrices of speech and noise, respec-tively, of frame n.

An alternative point of view for the Wiener filtering is through MMSEestimation under a Gaussian assumption. Assuming that speech and noiseare of zero-mean Gaussian distributions, with covariance matrices Dn andDn, it can be shown [67] that the conditional PDF of xn given yn is alsoGaussian with mean Dn(Dn + Dn)−1yn. Hence, the MMSE estimator fora Gaussian model is linear and leads to the same result as the linear MMSEestimator of an arbitrary PDF (34).

Due to the correspondence between the convolution in the time domainand multiplication in the frequency domain, the Wiener filter in the fre-quency domain formulation has the simpler form of (25) with

Hn[k]WF =λ2

n[k]

λ2n[k] + λ2

n[k], (35)

for the k’th frequency bin.Practical implementation of the Wiener filter requires estimation of the

second-order statistics (the power spectra) of speech and noise: λ2n[k] and

λ2n[k]. The noise spectrum may be obtained through a noise estimation

algorithm (section 2.1) and the speech spectrum may be estimated using(28). Hence, the actual implementation of the Wiener filtering is closelyrelated to the power spectral subtraction method, and also suffers from themusical noise issue.

Speech spectral amplitude estimation

Another well-known and widely applied MMSE estimator for speech en-hancement is the MMSE short-time spectral amplitude (STSA) estimatorby Ephraim and Malah [38]. The STSA method does not suffer from mu-sical noise in the enhanced speech, and is therefore interesting also from aperceptual point of view.

From the MMSE estimation result (31), the STSA estimator is formu-lated as the expected value of the speech spectral amplitude conditioned


on the noisy spectrum. The STSA estimator depends on two parameters,a-posteriori SNR Rpost

n [k] and a-priori SNR Rprion [k], that need to be eval-

uated for each frequency bin. The a-posteriori SNR is defined as the ratioof the noisy periodogram and the noise power spectrum estimate2

Rpostn [k] =

|Yn[k]|2ˆλ2

n[k], (36)

and the a-priori SNR is defined as the ratio of the speech and noise powerspectrum [38,92]

Rprion [k] =

λ2n[k]

λ2n[k]

. (37)

The optimal attenuation filter for the STSA estimator is shown to be [38]

Hn[k] =

√π

2

√

√

√

√

(

1

Rpostn [k]

)

(

Rprion [k]

1 + Rprion [k]

)

M

(

(

Rpostn [k]

)

(

Rprion [k]

1 + Rprion [k]

))

, (38)

where M is the function

M(θ) = exp

(

−θ

2

)(

(1 + θ)I0

(

θ

2

)

+ θI1

(

θ

2

))

, (39)

where I0 and I1 are the zeroth and first-order modified Bessel functions,respectively.

In practice, the a-priori SNR is not available and needs to be estimated.Ephraim and Malah proposed a solution based on the decision-directed ap-proach, and the a-priori SNR is estimated as [38]

Rprion [k] = α

|Hn−1[k]Yn−1[k]|2ˆλ2

n[k]+ (1 − α)max(Rpost

n [k] − 1, 0), (40)

where 0.9 ≤ α < 1 is a tuning parameter.The STSA method has demonstrated superior perceptual quality com-

pared to classic methods such as spectral subtraction and Wiener filter-ing. In particular, the enhanced speech does not suffer from the musicalnoise phenomenon. It has been argued [22] that the perceptual relevance ofthe method is mainly due to the decision-directed approach for estimating

2This definition differs from the traditional definition of signal-to-noise ratio by in-cluding both speech and noise in the numerator.

20 Introduction

a-priori SNR. The decision-directed approach has an attractive temporalsmoothing behavior on the SNR estimate. It was shown [22] that for lowRpost, i.e., the absence of speech, the Rprio is an averaged SNR of a largenumber of successive frames; for high Rpost, Rprio follows Rpost with one-frame delay. This smoothing behavior has a significant effect on the per-ceptual quality of the enhanced speech, and the same strategy is applicableto other enhancement methods such as the Wiener filtering and spectralsubtraction to reduce the musical noise in the enhanced speech.

Research challenges

The methods we have discussed in section 2.2 require an estimate of thenoise power spectrum. Most noise estimation algorithms are based on theassumption that the noise is stationary or mildly non-stationary. Modernmobile devices are required to operate in acoustical environments with largediversity and variability in interfering noise. The non-stationarity may bedue to the noise source itself, i.e., the noise contains impulsive or transientcomponents, periodic patterns, or due to the movement of the noise sourceand/or the recording device. The latter scenario occurs commonly on astreet, where noise sources (e.g., cars) move rapidly around.

Due to the increased degree of non-stationarity in such an environment,performance of most noise estimation algorithms is non-ideal and may sig-nificantly affect the performance of the speech estimation algorithm. Designof a noise reduction system that is capable of dealing with non-stationarynoise is of great interest and a challenging task to the research community.

From the Bayesian estimation point of view, the performance of thespeech estimator is determined by the accuracy of the speech and noisemodels, and perceptual relevancy of the distortion measure. Unfortunately,the true statistics of speech and noise are not explicitly available. Also,speech quality is a subjective measure that requires human listening exper-iments, and is expensive to assess. Objective measures that approximatethe “true” perceptual quality have been proposed, e.g., for the evaluationof speech codecs [2, 14, 49, 50, 104, 110, 137, 138, 140], but most of them arecomplex and mathematically intractable for deriving a speech estimator.Nevertheless, a number of algorithms that are based on distortion mea-sures that capture some properties of human perception have been pro-posed [57, 62, 79, 133,136,143]. For instance, the speech estimator in PaperA optimizes a criterion that allows for an adjustable level of residual noisein the enhanced speech, in order to avoid unnecessary speech artifacts.

Incorporating a higher degree of prior information of speech and noisecan also improve the performance of the speech estimator. With short-termmodels, the assumed prior information is mainly the type of PDF, of whichthe parameters of both the speech and noise PDFs (their respective powerspectra for Gaussian PDFs) need to be estimated. To incorporate stronger


prior information of speech and noise, more refined statistical models shouldbe used. For instance, a more precise PDF of speech may be obtainedby using a speech database and optimizing a long-term model through atraining algorithm. In the next section, we discuss noise reduction methodsusing long-term models such as an HMM.

2.3 Long-term model based approach

This section provides an overview of noise reduction methods using long-term models from section 1.2. Long-term model based noise reduction sys-tems include HMM based methods [35,111], GMM based methods [32,149]and codebook based methods [73, 121]. In this section, we focus mainly onmethods based on hidden Markov models and their derivatives.

HMM-based speech estimation

Due to the inter-frame dependency of a HMM, the HMM based MMSEestimator (31) of speech can be formulated using current and past framesof the noisy speech,

xn =

∫

xnf(xn|Yn1 )dxn, (41)

where Yn1 = {y1, . . . ,yn} denotes the noisy observation sequence from frame

1 to n. Due to the Markov chain assumption of the underlying state se-quence, both the current and past noisy frames are utilized in the speechestimator of the current frame.

Assuming that both speech and noise are modeled using an HMM withGaussian sub-models, it can be shown that the noisy model is a compos-ite HMM with Gaussian sub-models due to the statistical independence ofspeech and noise. If the speech model has M states and the noise modelM states, the composite HMM has MM states, one for each combinationof speech and noise state. Under the Gaussian assumption, the noisy PDFof a given state is a Gaussian PDF with zero mean and covariance matrixDs = Ds + Ds. The conditional PDF of speech given the noisy observationsequence can be formulated as

f(xn|Yn1 ) =

∑

sn

f(sn,xn|Yn1 )

=∑

sn

f(sn|Yn1 )f(xn|Yn

1 , sn)

=∑

sn

f(sn|Yn1 )f(xn|yn, sn), (42)

where the last step is due to the conditional independency property of anHMM, which states that the observation for a given state is independent of

22 Introduction

other states and observations. Hence, the MMSE speech estimator using anHMM (41) can be formulated as

xn =∑

sn

f(sn|Yn1 )

∫

xnf(xn|yn, sn)dxn

=∑

sn

f(sn|Yn1 ) · Ds(Ds + Ds)

−1yn, (43)

and each of the per-state conditional expectations is a Wiener filter (34).The HMM based MMSE estimator is then a weighted sum of Wiener filters,where the weighting factors are the posterior state probabilities given thenoisy observations. The weighting factors can be calculated efficiently usingthe forward algorithm [40,105].

A significant difference between the HMM based estimator and theWiener filter (34) is that the Wiener filter requires on-line estimation ofboth speech and noise statistics, while the per-state estimators of the HMMapproach use speech model parameters obtained through off-line training(see section 1.2). Therefore, additional speech knowledge is incorporated inan HMM based method.

Noise reduction using an HMM was originally proposed by Ephraimin [35, 36] using an HMM with auto-regressive (AR) Gaussian sub-sourcesfor speech modeling. The noise statistics are modeled using a single Gaus-sian PDF, and the noise model parameters are obtained using the first fewframes of the noisy speech that are considered to be noise-only. Alterna-tively, an additional noise estimation algorithm may be used to obtain thenoise statistics. In [111], the HMM based methods are extended to allowHMM based noise models. The method of [111] uses several noise models,obtained off-line for different noise environments, and a classifier is used todetermine which noise model must be used. By utilizing off-line trainedmodels for both speech and noise, strong prior information of speech andnoise is incorporated. By having more than one noise state, each repre-senting a distinct noise spectrum, the method can handle rapid changes ofthe noise spectrum within a specific noise environment, and can providesignification improvement for non-stationary noise environments.

Stochastic gain modeling

A key issue of using long-term speech/noise models obtained through off-linetraining is the energy mismatch between the model and the observations.Since the model is obtained off-line using training data, the energy levelof the training data may differ from the speech in the noisy observation,causing a mismatch. Hence, strategies for gain adaptation are needed whenlong-term models are used.


When using an AR-HMM for speech modeling [35], the signal is modeledas an AR process in each state. The excitation variance of the AR modelis a parameter that is obtained from the training data. Consequently, thedifferent energy levels of a speech sound, typically due to differences inpronunciation and/or different vocalizations of individual speakers, are notefficiently modeled. In fact, an AR-HMM implicitly assumes that speechconsists of a finite number of frame energy levels. Another energy mismatchbetween the speech model and the observation is due to different record-ing conditions during off-line training and on-line estimation, and may beconsidered as a global energy mismatch. In the MMSE speech estimatorof [35], a global gain factor is used to reduce this mismatch.

A similar problem appears in noise modeling, since the noise energy levelin the noisy environment is inherently unknown, time-varying, and in mostnatural cases, different from the noise energy level in the training. In [111],a heuristic noise gain adaptation using a voice activity detector has beenproposed, where the adaptation is performed in speech pauses longer than100 ms. In [146], continuous noise gain model estimation techniques wereproposed.

In Paper A, we propose a new HMM based gain-modeling technique thatextends the HMM-based methods [35,111] by introducing explicit modelingof both speech and noise gains in a unified framework. The probabilitydensity function of xn for a given state s is the integral over all possiblespeech gains. Modeling the speech energy in the logarithmic domain, wethen have

fs(xn) =

∫ ∞

−∞fs(g

′n)fs(xn|g′n)dg′n, (44)

where g′n = log gn and gn denotes the speech gain in the linear domain.The extension of Paper A over the traditional AR-HMM is the stochas-

tic modeling of the speech gain gn, where gn is considered as a stochasticprocess. The PDF of gn is modeled using a state-dependent log-normaldistribution, motivated by the simplicity of the Gaussian PDF and theappropriateness of the logarithmic scale for sound pressure level. In thelogarithmic domain, we have

fs(g′n) =

1√

2πσ2s

exp

(

− 1

2σ2s

(g′n − µs − qn)2)

, (45)

with mean µs + qn and variance σ2s . The time-varying parameter qn denotes

the speech-gain bias, which is a global parameter compensating for the long-term energy level of speech, e.g., due to a change of physical location ofthe recording device. The parameters {µs, σ

2s} are modeled to be time-

invariant, and can be obtained off-line using training data, together withthe other speech HMM parameters.

24 Introduction

The model proposed in Paper A is referred to as a stochastic-gain HMM(SG-HMM), and may be applied to both speech and noise. To obtain thetime-invariant parameters of the model, an off-line training algorithm isproposed based on the EM technique. For time-varying parameters, such asqn, an on-line estimation algorithm is proposed based on the recursive EMtechnique. In Paper A, we demonstrate through objective and subjectiveevaluations that the new model improves the modeling of speech and noise,and also the noise reduction performance.

The proposed HMM based gain modeling is related to the codebookbased methods [73,74,122,124]. In the codebook methods, the prior speechand noise models are represented as codebooks of spectral shapes, repre-sented by linear prediction (LP) coefficients. For each frame, and eachcombination of speech and noise code vectors, the excitation variances areestimated instantaneously. In [73,123], the most likely combination of codevectors are used to form the Wiener filter. In [74, 124], a weighted sumof Wiener filters is proposed, where the weighting is determined by theposterior probabilities similarly to (43). The Bayesian formulation in [124]is based on the assumption that both the code vectors and the excitationvariances are uniformly distributed.

Adaptation of the noise HMM

It is widely accepted that the characteristics of speech can be learned from a(sufficiently rich) database of speech. A successful example is in the speechcoding application, where, e.g., the linear prediction coefficients (LPC) canbe represented with transparent quality using a finite number of bits [97].However, noise may vary to a large extent in real-world situations. It isunrealistic to assume that one can capture all of this variation in an initiallearning stage, and this implies that on-line adaptation to changing noisecharacteristics is necessary.

Several methods have been proposed to allow noise model adaptationbased on different assumptions, e.g., [149] for a GMM and [84, 93] for AR-HMM. The algorithm in [84, 149] only applies to a batch of noisy data,e.g., one utterance, and is not directly applicable for on-line estimation.The noise model in [149] is limited to stationary Gaussian noise (white orcolored). In [84, 93], a noise HMM is estimated from the observed noisyspeech, using a voice-activity-detector (VAD), such that noise-only framesare used in the on-line learning.

In Paper B, we propose an adaptive noise model based on SG-HMM. Theresults from Paper A demonstrate good performance for modeling speechand noise. The work is extended in Paper B with an on-line noise estimationalgorithm for enhancement in unknown noise environments. The model pa-rameters are estimated on-line using the noisy observations without requir-ing a VAD. Therefore, the on-line learning of the noise model is continuous

3 Flexible source coding 25

and facilitates adaptation and correction to changing noise characteristics.Estimation of the noise model parameters is based on maximizing the likeli-hood of the noisy model, and the proposed implementation is based on therecursive expectation maximization (EM) framework [72,132].

3 Flexible source coding

The second part of this thesis focuses on flexible source coding for unreliablenetworks. In Paper C, we propose a quantizer design based on a Gaussianmixture model for describing the source statistics. The proposed quantizerdesign allows for rate adaptation in real-time depending on the currentnetwork condition. This section provides an overview of the theory andpractice that form the basis for quantizer design.

The section is organized as follows. Basic definitions and design criteriaare formulated in section 3.1. High-rate theory is introduced in section 3.2.In section 3.3, some practical coder designs motivated by high-rate theoryare discussed.

3.1 Source coding

Information theory, founded by Shannon in his legendary 1948 paper [116],forms the theoretical foundation for modern digital communication systems.As shown in [116], the communication chain can roughly be divided intotwo separate parts: source coding for data compression and channel codingfor data protection. Optimal performance of source and channel codingcan be predicted using the source and channel coding theorems [116]. Animplication of the theorems is that optimal transmission of a source may beachieved by separately performing source coding followed by channel coding,a result commonly referred to as the source-channel separation theorem.The source and channel coding theorems motivate the design of essentiallyall modern communication systems. The optimality of separate source andchannel coding is based on the assumption of coding in infinite dimensions,therefore requires an infinite delay. While the optimality is not guaranteedin any practical coders with finite delay, the design is used in practicalapplications due to its simplicity.

Source coding consists of lossless coding [63, 75–77, 109, 142], to achievecompression without introducing error, and lossy coding [16, 117], whichachieves a higher compression level but introduces distortion. Modernsource coders utilize lossy coding and remove both irrelevancy and redun-dancy. They often incorporate lossless coding methods to remove redun-dancy from parameters that require perfect reconstruction, such as quanti-zation indices. Such a system typically consists of a quantizer (lossy coding)followed by an entropy coder (lossless coding). The rate-distortion perfor-

26 Introduction

0 0.2 0.4 0.6 0.8 1 1.20

0.5

1

1.5

2

2.5

3

3.5

MSE distortion

Rat

e (b

its)

Figure 6: Rate-distortion function for a unit-variance Gaussian sourceand the mean square error distortion measure.

mance for the optimal coder design can be predicted by rate-distortion the-ory [116]. The theory provides a theoretical bound for the achievable ratewithout exceeding a given distortion. The rate-distortion bound demon-strates an interesting and intuitive behavior, namely that the achievablerate is a monotonically decreasing convex function of distortion. An exam-ple of a rate-distortion function for a unit-variance Gaussian source is shownin Figure 6. While the theory is derived under the assumption of infinitedimensions, it is expected that the behavior is also relevant in lower dimen-sions. Flexible source coder design should allow adjustment of the rate-distortion trade-off, e.g., multiple operation points on the rate-distortioncurve, such that the coder can operate optimally over a large range of rates.For an unreliable network, it is then possible to adapt the rate of the coder inreal-time to the current network condition. By operating at a rate adaptedto the network condition, the coder performs with the best quality (lowestdistortion) for the available bandwidth.

Quantization

Let x denote a source vector in the K dimensional Euclidean space RK

distributed according to the PDF f(x). Quantization is a function

x = Q(x) (46)

that maps each source vector x ∈ RK to a code vector x from a codebook,

which consists of a countable set of vectors in RK . Each code vector is

associated with a quantization cell (decision region), cell(x), defined as

cell(x) = {x ∈ RK : Q(x) = x}. (47)


An index is assigned to each code vector in the codebook, and the indexis coded with a codeword, i.e., a sequence of bits. Each quantization indexcorresponds to a particular codeword and all codewords together form auniquely decodable code, which implies that any encoded sequence of indicescan be uniquely decoded. For K = 1, the quantizer applies to scalar values,and is called a scalar quantizer. The more general quantizer with K ≥ 1 iscalled a vector quantizer.

Rate

Rate can be defined as the average codeword length per source vector. Sincethe available bandwidth is limited, the quantizer is optimized under a rateconstraint. The codewords may be constrained to have a particular fixedlength, resulting in a fixed-rate coder, or constrained to have a particularaverage length, resulting in a variable-rate coder.

For a fixed-rate coder with N code vectors in the codebook, the rate isgiven by

Rfix. = dlog2 Ne, (48)

where d·e denotes the ceiling function (the smallest integer equal to or largerthan the input value). To achieve efficient quantization, N is typicallychosen to be an integer power of 2, to avoid rounding.

For a variable-rate coder, code vectors are coded using codewords withdifferent lengths. The rate determines the average codeword length,

Rvar. =∑

x

p(x)`(x), (49)

where `(x) denotes the codeword length of x and p(x) =∫

cell(x)f(x)dx is

the probability mass function of x, obtained by integrating f(x) over thequantization cell of x. From the source coding theorem [116], the optimalvariable-rate code has an average rate approaching the entropy of x, denotedH,

H = −∑

x

p(x) log2 p(x). (50)

Using an entropy code such as the arithmetic code [75–77,109,142], this ratecan be approached arbitrarily closely by increasing the sequence length.

Distortion

The performance of a quantizer is measured by the difference or error be-tween the source vector and the quantized code vector. This differencedefines an error function, denoted d(x, x), that quantifies the distortion dueto the quantization of x. The performance of the system is given by the

28 Introduction

average distortion defined as the expected error with respect to the inputvariable,

D =

∫

d(x, x)f(x)dx. (51)

The error function should be meaningful with respect to the source, and ittypically fulfills some basic properties, e.g., [53]

• d(x, x) ≥ 0

• if x = x then d(x, x) = 0.

The most commonly used error function is the square error function,

d(x, x) =1

K||x − x||2 =

1

K(x − x)T (x − x), (52)

where || · || denotes the L2 vector norm, and (·)T denotes the transpose.The mean distortion of (52) is referred to as the mean square error (MSE)distortion. The MSE distortion is commonly used due to its mathematicaltractability.

Other distortion measures may be conceptually or computationally morecomplicated, e.g., by incorporating perceptual properties of the human audi-tory systems. If a complex error function can be assumed to be continuousand have continuous derivatives, the error function can be approximatedunder a small error assumption. By applying a Taylor series expansion ofd(x, x) around x = x, and noting that the first two expansion terms vanishfor x = x, the error function may then be approximated as [44]

d(x, x) ≈ 1

2K(x − x)T W(x)(x − x), (53)

where W(x) is a K×K dimensional matrix, the so-called sensitivity matrix,with the i, jth element defined by

W(x)[i, j] =∂2d(x, x)

∂x[i]∂x[j]

∣

∣

∣

∣

x=x

, (54)

is the Jacobian matrix evaluated at the expansion point x. In [44], thesensitivity matrix approach was applied to quantization of speech linearprediction coefficients (LPC) using, e.g., the log-spectral distortion (LSD)measure. In [101, 102], a sensitivity matrix was derived and analyzed fora perceptual distortion measure based on a sophisticated spectral-temporalauditory model [29,30].


Quantizer design

The optimal quantizer can be formulated using the constrained optimizationframework, by minimizing the expected distortion, D, while constraining theaverage rate below or equal to the available rate budget. The constraint isdefined using (48) for a fixed-rate coder or (50) for a variable-rate coder.The resulting quantizer designs are referred to as a resolution-constrainedquantizer and an entropy-constrained quantizer, respectively. In general, anentropy-constrained quantizer has better rate-distortion performance com-pared to a resolution-constrained quantizer, due to the flexibility of assign-ing bit sequences of different lengths to different code vectors according tothe probability of their appearance. In Paper C, we propose a flexible designof an entropy-constrained vector quantizer.

Traditional quantizer design is typically based on an iterative trainingalgorithm [23,81,82,91]. By iteratively optimizing the encoder (decision re-gions or quantization cells) and decoder (reconstruction point of each cell),and ensuring that the performance is non-decreasing at each iteration, a lo-cally optimal quantizer is constructed. Using a proper initialization scheme,and given sufficient training data, a good rate-distortion performance canbe achieved by a training based quantizer.

An issue with the traditional quantizer design is the computational com-plexity and memory requirement associated with the codebook search andstorage. As the number of code vectors is exponentially increasing withrespect to the rate, and in general a higher rate (per vector) is neededfor quantizing higher dimensional vectors, the approach is often limited tolow-rate and scalar or low-dimensional vector quantizers.

Another disadvantage of the traditional training based quantizer designis that the resulting quantizer has a fixed codebook. Its rate-distortion per-formance is determined off-line, and adaptation of the codebook accordingto the current available bandwidth is unfeasible without losing optimality.To allow a certain flexibility, one can optimize multiple codebooks, eachfor a different rate, and select the best codebook. A disadvantage of thisapproach is, however, the additional memory requirement, e.g., for storageof multiple codebooks.

It is desirable to have a flexible quantizer design that allows for a fine-grained rate adaptation, such that the rate can be selected from a largerrange with a finer resolution, without increasing the memory requirement.To achieve that, the quantizer design should avoid usage of an explicit code-book (that requires storage), and the code vectors are constructed on-linefor a given rate. High-rate theory provides a theoretical basis for designingsuch a quantizer.

30 Introduction

3.2 High-rate theory

High-rate theory is a mathematical framework for analyzing the behav-ior and predicting the performance of a quantizer of any dimensionality.High-rate theory originated from the early works on scalar quantization[15, 47, 96, 98], and was later extended for vector quantization [45, 144]. Acomprehensive tutorial on quantization and high-rate theory is given in [54].Textbooks treating the topic include [16,46,56,71].

High-rate theory applies when the rate is high with respect to thesmoothness of the source PDF f(x). Under this high-rate assumption,f(x) may be assumed to be constant within each quantization cell, suchas f(x) ≈ f(x) in cell(x). The probability mass function (PMF) of x canthen be approximated using

p(x) ≈ f(x)vol(x), (55)

where vol(x) denotes the volume of cell(x).Instead of using a codebook, a high-rate quantizer can be formulated

as a quantization point density function, specifying the distribution of thecode vectors, e.g., the number of code vectors per unit volume. A quanti-zation point density function may be defined as a continuous function thatapproximates the inverse volumes of the quantization cells [71],

gc(x) ≈ 1

vol(x), if x ∈ cell(x). (56)

Under the high-rate assumption, optimal solutions of gc(x) can be solvedfor different rate constraints. Practical quantizers designed according tothe obtained gc(x) approach optimality with increasing rate. Nevertheless,quantizers designed based on the high-rate theory have shown competitiveperformance also at a lower rate. In Paper C, the actual performance ofthe proposed quantizer is very close to the predicted performance by thehigh-rate theory for rates higher than 3-4 bits per dimension.

High-rate distortion

The average distortion of a quantizer can be reformulated using the high-rate assumption. For the MSE distortion measure, (51) can be written as

DMSE =1

K

∫

||x − x||2f(x)dx

≈ 1

K

∑

x

∫

cell(x)

||x − x||2f(x)dx

=1

K

∑

x

p(x)

vol(x)

∫

cell(x)

||x − x||2dx. (57)


Let

C(x) =1

Kvol(x)−

K+2

K

∫

cell(x)

||x − x||2dx, (58)

denote the coefficient of quantization of x. C(x) is the normalized distortionwith respect to volume, and shows the effect of the geometry of the cell.The distortion can then be written as

DMSE =∑

x

p(x)vol(x)2K C(x). (59)

Gersho conjectured that the optimal quantizer has the same cell geome-try for all quantization cells, and has a constant coefficient of quantizationCopt. [45]. Using the optimal cell geometry, C(x) ≈ Copt., then the distor-tion is

DMSE ≈ Copt.

∑

x

p(x)vol(x)2K

≈ Copt.

∫

f(x)gc(x)−2K dx. (60)

Resolution-constrained vector quantizer

The optimal resolution-constrained vector quantizer (RCVQ) has the quan-tization point density function gc(·) that minimizes the average distortionD, under a rate constraint that the number of code vectors equals N . Theconstraint may be formulated through integration of gc(x),

∫

gc(x)dx = N. (61)

The optimal gc(·) minimizes the extended criterion

ηrcvq = Copt.

∫

f(x)gc(x)−2K dx + λ

(∫

gc(x)dx − N

)

, (62)

where the first term is the high-rate distortion (60) and λ is the Lagrangemultiplier for the rate constraint. Solving the Euler-Lagrange equation, andcombining with (61), the optimal gc(x) for constrained-resolution vectorquantization with respect to the MSE distortion measure is given by

gc(x) =Nf(x)

K

K+2

∫

f(x)K

K+2 dx. (63)

32 Introduction

Entropy-constrained vector quantizer

The optimal entropy-constrained vector quantizer (ECVQ) has a quantiza-tion point density function gc(·) that minimizes the average distortion D,under the constraint that the entropy of the code vectors equals a desiredtarget rate Rt. The entropy of the quantized variable at a high rate can beformulated as

H ≈ −∑

x

f(x)vol(x) log2(f(x)vol(x))

≈ −∫

f(x) log2

f(x)

gc(x)dx

= h +

∫

f(x) log2 gc(x)dx, (64)

where h = −∫

f(x) log2 f(x)dx is the differential entropy of the source.Therefore, the constraint can be written as

∫

f(x) log2 gc(x)dx = R′t, where

R′t = Rt − h. The optimal gc(·) minimizes the extended criterion

ηecvq = Copt.

∫

f(x)gc(x)−2K dx + λ

(∫

f(x) log2 gc(x)dx − R′t

)

, (65)

where λ is the Lagrange multiplier for the entropy constraint.The optimal gc(x) can be solved similarly to the RCVQ case, and the

resulting gc(x) is a constant [45],

gc(x) = gc = 2Rt−h. (66)

This important result implies that, at high rate, uniformly distributed quan-tization points form an optimal quantizer if the point indices are subse-quently coded using an entropy code.

3.3 High-rate quantizer design

High-rate theory provides a theoretical foundation and guideline for design-ing practical quantizers. Quantizer design motivated by high-rate theoryinclude lattice quantization [9, 10, 25, 27, 28, 41, 66, 127, 134, 145], and GMMbased quantization [52,58,106,112–115,127–129,148, Paper C].

Lattice quantization

A lattice Λ is an infinite set of regularly arranged vectors in the Euclideanspace R

K [28]. It can be defined through a generating matrix G, assumedto be a square matrix,

Λ(G) = {GT u : u ∈ Zk}, (67)


−2 −1 0 1 2

−2

−1

0

1

2

Figure 7: The 2-dimensional hexagonal lattice.

where G is a matrix with linearly independent row vectors that form a setof basis vectors in R

K . The lattice points generated through G are integerlinear combinations of the row vectors of G.

Applied to quantization, each lattice point λ ∈ Λ is associated witha Voronoi region containing all points in R

K closest to λ according to theEuclidean distance measure. The Voronoi region corresponds to the decisionregion of a quantizer with the MSE distortion measure. Hence, a closestpoint search algorithm can be used for quantization. Figure 7 shows a two-

dimensional lattice, the hexagonal lattice (defined by G =

[

1 0

0.5√

32

]

),

together with the Voronoi cells.The high-rate theory for entropy-constrained vector quantization forms

the theoretical basis for using uniformly distributed code vectors. Suchvectors may be generated using a lattice, and the quantizer is then a latticequantizer. The design of a coder based on an entropy-constrained latticequantizer consists of the following steps: 1) selection of a suitable lattice forquantization, 2) design of a closest point search algorithm for the selectedlattice, and 3) indexing and assigning variable-length codewords to the codevectors.

The selection of a good lattice for quantization is considered in [9] forthe MSE distortion. Since the MSE performance of a lattice quantizer isproportional to the coefficient of quantization associated with the Voronoicell shape of the lattice, lattice optimization for quantization is based onan optimization with respect to the coefficient of quantization. A studyand tutorial on the “best known” lattices for quantization is given in [9].A numerical optimization method for obtaining a locally optimal lattice forquantization in an arbitrary dimension is proposed in [9].

Closest point search algorithms for lattices are considered in [10, 27].The algorithms proposed in [27] are based on the geometrical structures

34 Introduction

of the lattices, and are therefore lattice-specific. A general searching al-gorithm that applies to an arbitrary lattice has been proposed [10]. Bothmethods are considerably faster than a full search over all code vectors ina traditional codebook, by utilizing the geometrically regular arrangementof code vectors. As the search complexity and memory requirement using atraditional codebook are exponentially increasing with respect to the vectordimensionality and rate, lattice quantization is an attractive approach forhigh-dimensional vector quantization.

The practical application of entropy-constrained lattice quantization hasbeen limited due to lack of a practical algorithm for indexing and assigningvariable-length codewords that work at high rates and dimensions. In [25,26, 66], lattice truncation and indexing the lattice vectors are considered.However, designing a practical entropy code is still limited to low dimensionsand rates, due to the exponentially increasing number of code vectors in thetruncated lattice. In Paper C, we propose a practical indexing and entropycoding algorithm for a lattice quantization of a Gaussian variable as part ofa GMM based entropy-constrained vector quantizer design.

GMM-based quantization

High-rate theory assumes that the PDF of the source is known. However,for most practical sources, the PDF is unknown and needs to be estimatedfrom the data. As discussed in section 2.3, GMMs form a convenient toolfor modeling real-world source vectors with an unknown PDF [107]. Theyhave been successfully applied to quantization, e.g., [52,58,114,127,128,131,Paper C].

GMMs have been applied to approximate the high-rate optimal quan-tization point density for resolution-constrained quantization [58, 114]. Acodebook can be generated using a pseudo random vector generator, oncethe optimal quantization point density is obtained. Thus, this method doesnot require the storage of multiple codebooks for different rate requirements.However, the method shares the same search complexity as a traditionalcodebook based quantizer. The application is therefore limited to low ratesand dimensions. Nevertheless, the method is useful for predicting the per-formance of an optimal quantizer [58,114].

Another GMM quantizer design is based on classification and compo-nent quantizers [127–129]. Such a quantizer is based on quantization of thesource vector using only one of the mixture component quantizers. The in-dex of the quantizing mixture component is transmitted as side information.The scheme is conceptually simple, since the problem is reduced to quan-tizer design for a Gaussian component. Similar ideas based on classifiedquantization appears in image coding [131] and speech coding [68–70].

The methods of [127–129] are resolution-constrained vector quantizersoptimized for a fixed target rate. The overall bit rate budget is distributed

4 Summary of contributions 35

among the mixture components and for transmitting the component index.The component quantizers are based on the so-called companding tech-nique. The companding technique attempts to approximate the optimalhigh-rate codebook of a Gaussian variable. It consists of an invertible map-ping function that maps the input variable into a domain where a latticequantizer is applied. The forward transform is referred to as the compres-sor and the inverse transform as expander. Applying the expander functionto the lattice structured codebook would result in a codebook that ap-proximates the optimal codebook in the original domain. Unfortunately,the companding technique is suboptimal for vector dimensions higher thantwo [20, 21, 45, 95, 119]. The sub-optimality is due to the fact that theexpander function scales the quantization cells differently in different di-mensions and in different locations in space. Hence, the quantization cellslose their optimal cell shape in the original domain. The great advantageof the schemes proposed in [127–129] is the low computational complexity.Also, codebook storage is not required by utilizing a lattice quantizer in thetransformed domain.

Applying classified quantization for entropy constrained quantizationhas been discussed in [52, 71]. Motivated by high-rate theory, the compo-nent quantizers of an entropy constrained quantizer are based on latticestructured codebooks. The focus of [52, 71] is on the high-rate analysis ofsuch a quantizer. In Paper C, we propose a practical GMM-based entropyconstrained quantizer using the classified quantization framework. In par-ticular, a practical solution to indexing and entropy coding for the latticequantizers of the Gaussian components is proposed. We show that, undercertain conditions, the scheme approaches the theoretically optimal perfor-mance with increasing rate.

4 Summary of contributions

The focus of this thesis is on statistical model based techniques for speechenhancement and coding. The main contributions of the thesis can be sum-marized as

• improved speech and noise modeling and estimation techniques ap-plied to speech enhancement, and

• improvements towards flexible source coding.

The thesis consists of three research articles. Short summaries of the articlesare presented below.

36 Introduction

Paper A: HMM-based Gain-Modeling for Enhancement of Speechin Noise

In Paper A, we propose a hidden Markov model (HMM) based speechenhancement method using stochastic-gain HMMs (SG-HMMs) for bothspeech and noise. We demonstrate that accurate modeling and estimation ofspeech and noise gains facilitate good performance in speech enhancement.Through the introduction of stochastic gain variables, energy variation inboth speech and noise is explicitly modeled in a unified framework. Thespeech gain models the energy variations of the speech phones, typically dueto differences in pronunciation and/or different vocalizations of individualspeakers. The noise gain helps to improve the tracking of the time-varyingenergy of non-stationary noise. The expectation-maximization (EM) al-gorithm is used to perform off-line estimation of the time-invariant modelparameters. The time-varying model parameters are estimated on-line us-ing the recursive EM algorithm. The proposed gain modeling techniques areapplied to a novel Bayesian speech estimator, and the performance of theproposed enhancement method is evaluated through objective and subjec-tive tests. The experimental results confirm the advantage of explicit gainmodeling, particularly for non-stationary noise sources.

Paper B: On-line Noise Estimation Using Stochastic-Gain HMMfor Speech Enhancement

In Paper B, we extend the work in Paper A and introduce an on-line noiseestimation algorithm. A stochastic-gain HMM is used to model the statis-tics of non-stationary noise with time-varying energy. The noise model isadaptive and the model parameters are estimated on-line from noisy obser-vations. The parameter estimation is derived for the maximum likelihoodcriterion and the algorithm is based on the recursive expectation maxi-mization (EM) framework. The proposed method facilitates continuousadaptation to changes of both noise spectral shapes and noise energy lev-els, e.g., due to movement of the noise source. Using the estimated noisemodel, we also develop an estimator of the noise power spectral density(PSD) based on recursive averaging of estimated noise sample spectra. Wedemonstrate that the proposed scheme achieves more accurate estimates ofthe noise model and noise PSD. When integrated to a speech enhancementsystem, the proposed scheme facilitates a lower level of residual noise in theenhanced speech.

Paper C: Entropy-Constrained Vector Quantization Using Gaus-sian Mixture Models

In Paper C, a flexible entropy-constrained vector quantization scheme basedon Gaussian mixture models (GMMs), lattice quantization, and arithmetic

References 37

coding is presented. A source vector is assumed to have a probability den-sity function described by a GMM. The source vector is first classified toone of the mixture components, and the Karhunen-Loeve transform of theselected mixture component is applied to the vector, followed by quanti-zation using a lattice structured codebook. Finally, the scalar elementsof the quantized vector are entropy coded sequentially using a speciallydesigned arithmetic code. The proposed scheme has a computational com-plexity that is independent of rate, and quadratic with respect to vectordimension. Hence, the scheme facilitates quantization of high dimensionalsource vectors. The flexible design allows for changing of the average rateon-the-fly. The theoretical performance of the scheme is analyzed under ahigh-rate assumption. We show that, at high rate, the scheme approachesthe theoretically optimal performance, if the mixture components are lo-cated far apart. The performance loss when the mixture components arelocated closely to each other can be quantified for a given GMM. The prac-tical performance of the scheme was evaluated through simulations on bothsynthetic and speech-derived source vectors, and competitive performancehas been demonstrated.

References

[1] Enhanced Variable Rate Codec, speech service option 3 for widebandspread spectrum digital systems. TIA/EIA/IS-127, Jul. 1996.

[2] Perceptual Evaluation of Speech Quality (PESQ), an objectivemethod for end-to-end speech quality assessment of narrow-band tele-phone networks and speech codecs. ITU-T Recommendation P.862,Feb. 2001.

[3] Test results of Mitsubishi AMR-NS solution based on TS 26.077.3GPP Tdoc S4-020251, May 2002.

[4] Test results of NEC AMR-NS solution based on TS 26.077. 3GPPTdoc S4-020415, Jul. 2002.

[5] Test results of NEC low-complexity AMR-NS solution based on TS26.077. 3GPP Tdoc S4-020417, Jul. 2002.

[6] Minimum performance requirements for noise suppressor; applicationto the Adaptive Multi-Rate (AMR) speech encoder (release 5). 3GPPTS 26.077, Jun. 2003.

[7] Selectable Mode Vocoder (SMV) service option for wideband spreadspectrum communication systems. 3GPP2 Document C.S0030-0, Jan.2004. Version 3.0.

38 Introduction

[8] Test results of Fujitsu AMR-NS solution based on TS 26.077 specifi-cation. 3GPP Tdoc S4-050016, Feb. 2005.

[9] E. Agrell and T. Eriksson. Optimization of lattices for quantization.IEEE Trans. Inform. Theory, 44(5):1814–1828, Sep. 1998.

[10] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger. Closest point searchin lattices. IEEE Trans. Inform. Theory, 48(8):2201–2214, Aug. 2002.

[11] J. Baker. The DRAGON system - an overview. IEEE Trans. Acoust.,Speech, Signal Processing, 23(1):24–29, Feb 1975.

[12] N. K. Bakirov. Central limit theorem for weakly dependent variables.Mathematical Notes, 41(1):63–67, Jan. 1987.

[13] L. E. Baum and J. A. Petrie. Statistical inference for probabilisticfunctions of finite state Markov chains. Ann. Math. Statist., 37:1554–1563, 1966.

[14] J. Beerends and J. Stemerdink. A perceptual speech-quality measurebased on a psychoacoustic sound representation. J. Audio Eng. Soc,42(3):115–123, 1994.

[15] W. R. Bennett. Spectra of quantized signals. Bell System TechnicalJournal, 27:446–472, 1948.

[16] T. Berger. Rate Distortion Theory: A Mathematical Basis for DataCompression. Prentice-Hall, Englewood Cliffs, N.J., 1971.

[17] J. Bilmes. A gentle tutorial on the EM algorithm and its applicationto parameter estimation for Gaussian mixture and hidden Markovmodels. Technical Report, University of Berkeley, ICSI-TR-97-021,1997.

[18] S. Boll. Suppression of acoustic noise in speech using spectral subtrac-tion. IEEE Trans. Acoust., Speech, Signal Processing, 2(2):113–120,Apr. 1979.

[19] C. Breithaupt and R. Martin. MMSE estimation of magnitude-squared DFT coefficients with supergaussian priors. In Proc. IEEEInt. Conf. Acoustics, Speech and Signal Processing, volume 1, pages896–899, April 2003.

[20] J. Bucklew. Companding and random quantization in several dimen-sions. IEEE Trans. Inform. Theory, 27(2):207–211, Mar 1981.

[21] J. Bucklew. A note on optimal multidimensional companders (cor-resp.). IEEE Trans. Inform. Theory, 29(2):279–279, Mar 1983.

References 39

[22] O. Cappe. Elimination of the musical noise phenomenon with theEphraim and Malah noise suppressor. IEEE Trans. Speech and AudioProcessing, 2(2):345–349, Apr. 1994.

[23] P. A. Chou, T. Lookabaugh, and R. M. Gray. Entropy-constrainedvector quantization. IEEE Trans. Acoust., Speech, Signal Processing,37(1):31–42, Jan. 1989.

[24] I. Cohen. Noise spectrum estimation in adverse environments: Im-proved minima controlled recursive averaging. IEEE Trans. Speechand Audio Processing, 11(5):466–475, Sep. 2003.

[25] J. Conway and N. J. A. Sloane. A fast encoding method for latticecodes and quantizers. IEEE Trans. Inform. Theory, 29(6):820–824,Nov. 1983.

[26] J. H. Conway, E. M. Rains, and N. J. A. Sloane. On the existence ofsimilar sublattice. Canadian Jnl. Math., 51:1300–1306, 1999.

[27] J. H. Conway and N. J. A. Sloane. Fast quantizing and decodingalgorithms for lattice quantizers and codes. IEEE Trans. Inform.Theory, 28(2):227–232, Mar. 1982.

[28] J. H. Conway and N. J. A. Sloane. Sphere Packings, Lattices andGroups. Springer-Verlag, 2nd edition, 1993.

[29] T. Dau, D. Puschel, and A. Kohlrausch. A quantitative model of theeffective signal processing in the auditory system. I. model structure.J. Acoust. Soc. Am., 99(6):3615–3622, June 1996.

[30] T. Dau, D. Puschel, and A. Kohlrausch. A quantitative model of theeffective signal processing in the auditory system. II. simulations andmeasurements. J. Acoust. Soc. Am., 99(6):3623–3631, June 1996.

[31] A. P. Dempster, N. Laird, and D. B. Rubin. Maximum likelihoodfrom incomplete data via the EM algorithm. J. Roy. Statist. Soc. B,39(1):1–38, 1977.

[32] L. Deng, J. Droppo, and A. Acero. Recursive estimation of nonstation-ary noise using iterative stochastic approximation for robust speechrecognition. IEEE Trans. Speech and Audio Processing, 11(6):568–580, Nov. 2003.

[33] V. Digalakis, P. Monaco, and H. Murveit. Genones: generalized mix-ture tying in continuous hidden Markov model-based speech recog-nizers. IEEE Trans. Speech and Audio Processing, 4(4):281–289, July1996.

40 Introduction

[34] V. Digalakis, D. Rtischev, and L. Neumeyer. Speaker adaptation usingconstrained estimation of Gaussian mixtures. IEEE Trans. Speech andAudio Processing, 3(5):357–366, Sept. 1995.

[35] Y. Ephraim. A Bayesian estimation approach for speech enhance-ment using hidden Markov models. IEEE Trans. Signal Processing,40(4):725–735, Apr. 1992.

[36] Y. Ephraim. Gain-adapted hidden Markov models for recognition ofclean and noisy speech. IEEE Trans. Signal Processing, 40(6):1303–1316, Jun. 1992.

[37] Y. Ephraim. Statistical-model-based speech enhancement systems.Proceedings of the IEEE, 80(10):1526–1555, Oct. 1992.

[38] Y. Ephraim and D. Malah. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEETrans. Acoust., Speech, Signal Processing, 32(6):1109–1121, Dec.1984.

[39] Y. Ephraim and D. Malah. Speech enhancement using a minimummean-square error log-spectral amplitude estimator. IEEE Trans.Acoust., Speech, Signal Processing, 33(2):443–445, Apr 1985.

[40] Y. Ephraim, D. Malah, and B.-H. Juang. On the application of hiddenMarkov models for enhancing noisy speech. IEEE Trans. Acoust.,Speech, Signal Processing, 37(12):1846–1856, Dec. 1989.

[41] T. Eriksson. Vector Quantization in Speech Coding - Variable rate,memory and lattice quantization. PhD thesis, Chalmers University ofTechnology, 1996.

[42] T. H. Falk and W.-Y. Chan. Nonintrusive speech quality estima-tion using gaussian mixture models. IEEE Signal Processing Letters,13(2):108–111, Feb. 2006.

[43] G. Fant. Acoustic theory of speech production. The Hague, Nether-lands: Mouton, 1970.

[44] W. Gardner and B. Rao. Theoretical analysis of the high-rate vec-tor quantization of LPC parameters. IEEE Trans. Speech and AudioProcessing, 3(5):367–381, Sept. 1995.

[45] A. Gersho. Asymptotically optimal block quantization. IEEE Trans.Inform. Theory, 25(4):373–380, Oct. 1979.

[46] A. Gersho and R. M. Gray. Vector quantization and signal compres-sion. Boston, MA: Kluwer Academic Publishers, 1992.

References 41

[47] H. Gish and J. N. Pierce. Asymptotically efficient quantizing. IEEETrans. Inform. Theory, 14(5):676–683, 1968.

[48] Z. Goh, K.-C. Tan, and T. Tan. Postprocessing method for suppress-ing musical noise generated by spectral subtraction. IEEE Trans.Speech and Audio Processing, 6(3):287–292, May 1998.

[49] V. Grancharov, D. Y. Zhao, J. Lindblom, and W. B. Kleijn. Lowcomplexity, non-intrusive speech quality assessment. In IEEE Trans.Audio, Speech and Language Processing, volume 14, pages 1948–1956,Nov. 2006.

[50] V. Grancharov, D. Y. Zhao, J. Lindblom, and W. B. Kleijn. Non-intrusive speech quality assessment with low computational complex-ity. In Interspeech - ICSLP, September 2006.

[51] R. Gray. Reply to comments on ’source coding of the discrete Fouriertransform’. IEEE Trans. Inform. Theory, 26(5):626–626, Sep 1980.

[52] R. Gray. Gauss mixture vector quantization. In Proc. IEEE Int. Conf.Acoustics, Speech and Signal Processing, volume 3, pages 1769–1772,2001.

[53] R. Gray, A. Buzo, J. Gray, A., and Y. Matsuyama. Distortion mea-sures for speech processing. IEEE Trans. Acoust., Speech, Signal Pro-cessing, 28(4):367–376, Aug 1980.

[54] R. Gray and D. Neuhoff. Quantization. IEEE Trans. Inform. Theory,44(6):2325–2383, Oct. 1998.

[55] R. Gray, J. Young, and A. Aiyer. Minimum discrimination informationclustering: modeling and quantization with Gauss mixtures. In Proc.IEEE Int. Conf. Image Processing, volume 3, pages 14–17, Oct. 2001.

[56] R. M. Gray. Source coding theory. Kluwer Academic Publisher, 1990.

[57] S. Gustafsson, P. Jax, and P. Vary. A novel psychoacoustically moti-vated audio enhancement algorithm preserving background noise char-acteristics. In Proc. IEEE Int. Conf. Acoustics, Speech and SignalProcessing, volume 1, pages 397–400, 1998.

[58] P. Hedelin and J. Skoglund. Vector quantization based on Gaussianmixture models. IEEE Trans. Speech and Audio Processing, 8(4):385–401, Jul. 2000.

[59] H. G. Hirsch and C. Ehrlicher. Noise estimation techniques for robustspeech recognition. In Proc. IEEE Int. Conf. Acoustics, Speech andSignal Processing, volume 1, pages 153–156, 1995.

42 Introduction

[60] Y. Hu and P. Loizou. Subjective comparison of speech enhancementalgorithms. In Proc. IEEE Int. Conf. Acoustics, Speech and SignalProcessing, volume 1, pages 153–156, 2006.

[61] Y. Hu and P. Loizou. A comparative intelligibility study of speechenhancement algorithms. In Proc. IEEE Int. Conf. Acoustics, Speechand Signal Processing, pages 561–564, 2007.

[62] Y. Hu and P. C. Loizou. A perceptually motivated approach for speechenhancement. IEEE Trans. Speech and Audio Processing, 11(5):457–465, Sep. 2003.

[63] D. A. Huffman. A method for the construction of minimum redun-dancy codes. Proc. IRE, 40:1098–1101, 1952.

[64] N. S. Jayant and P. Noll. Digital Coding of Waveforms. Prentice Hall,1984.

[65] F. Jelinek. Continuous speech recognition by statistical methods. Pro-ceedings of the IEEE, 64(4):532–556, April 1976.

[66] D. G. Jeong and J. D. Gibson. Uniform and piecewise uniform latticevector quantization for memoryless Gaussian and Laplacian sources.IEEE Trans. Inform. Theory, 39(3):786–804, May 1993.

[67] S. M. Kay. Fundamentals of Statistical Signal Processing: EstimationTheory. Prentice Hall, 1993.

[68] M. Y. Kim and W. Bastiaan Kleijn. KLT-based classified VQ for thespeech signal. In Proc. IEEE Int. Conf. Acoustics, Speech and SignalProcessing, volume 1, pages 645–648, 2002.

[69] M. Y. Kim and W. Bastiaan Kleijn. KLT-based classified VQ withcompanding. In Proc. IEEE Works. Speech Coding, pages 8–10, Oct.2002.

[70] M. Y. Kim and W. Kleijn. KLT-based adaptive classified VQ of thespeech signal. IEEE Trans. Speech and Audio Processing, 12(3):277–289, May 2004.

[71] W. B. Kleijn. A basis for source coding. Lecture notes, KTH, Stock-holm, Jan. 2003.

[72] V. Krishnamurthy and J. Moore. On-line estimation of hidden Markovmodel parameters based on the Kullback-Leibler information measure.IEEE Trans. Signal Processing, 41(8):2557–2573, Aug. 1993.

References 43

[73] M. Kuropatwinski and W. B. Kleijn. Estimation of the excitationvariances of speech and noise AR-models for enhanced speech coding.In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing,volume 1, pages 669–672, May 2001.

[74] M. Kuropatwinski and W. B. Kleijn. Minimum mean square error es-timation of speech short-term predictor parameters under noisy con-ditions. 1:96–99, 2003.

[75] G. Langdon and J. Rissanen. A simple general binary source code(corresp.). IEEE Trans. Inform. Theory, 28(5):800–803, Sep 1982.

[76] G. Langdon and J. Rissanen. Correction to ’a simple general bi-nary source code’ (sep 82 800-803). IEEE Trans. Inform. Theory,29(5):778–779, Sep 1983.

[77] G. G. Langdon. An introduction to arithmetic coding. IBM Journalof Research and Development, 28:135–149, 1984.

[78] J. S. Lim, editor. Speech Enhancement (Prentice-Hall Signal Process-ing Series). Prentice Hall, 1983.

[79] L. Lin, W. H. Holmes, and E. Ambikairajah. Speech denoising us-ing perceptual modification of Wiener filtering. Electronics Letters,38(23):1486–1487, Nov. 2002.

[80] J. Lindblom and J. Samuelsson. Bounded support Gaussian mixturemodeling of speech spectra. IEEE Trans. Speech and Audio Process-ing, 11(1):88–99, Jan. 2003.

[81] Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantizerdesign. Communications, IEEE Transactions on [legacy, pre - 1988],28(1):84–95, Jan 1980.

[82] S. Lloyd. Least squares quantization in PCM. IEEE Trans. Inform.Theory, 28(2):129–137, Mar 1982.

[83] P. Lockwood and J. Boudy. Experiments with a nonlinear spectralsubtractor (NSS), hidden Markov models and the projection, for ro-bust speech recognition in cars. Speech Communication, 11(2-3):215–228, 1992.

[84] B. Logan and T. Robinson. Adaptive model-based speech enhance-ment. Speech Communication, 34(4):351–368, Jul. 2001.

[85] T. Lotter and P. Vary. noise reduction by maximum a posteriori spec-tral amplitude estimation with supergaussian speech modeling. In In-ter. Workshop on Acoustic Echo and Noise Control (IWAENC2003),pages 86–89, Sept. 2003.

44 Introduction

[86] T. Lotter and P. Vary. Noise reduction by joint maximum a posteriorispectral amplitude and phase estimation with super-Gaussian speechmodelling. In Proc. 12th European Signal Processing Conference, EU-SIPCO, pages 1457–1460, 2004.

[87] D. Malah, R. Cox, and A. Accardi. Tracking speech-presence un-certainty to improve speech enhancement in non-stationary noise en-vironments. In Proc. IEEE Int. Conf. Acoustics, Speech and SignalProcessing, volume 2, pages 789–792, March 1999.

[88] R. Martin. Noise power spectral density estimation based on optimalsmoothing and minimum statistics. IEEE Trans. Speech and AudioProcessing, 9(5):504–512, Jul. 2001.

[89] R. Martin. Speech enhancement using MMSE short time spectralestimation with gamma distributed speech priors. In Proc. IEEE Int.Conf. Acoustics, Speech and Signal Processing, volume 1, pages 253–256, 2002.

[90] R. Martin. Speech enhancement based on minimum mean-square errorestimation and supergaussian priors. IEEE Trans. Speech and AudioProcessing, 13(5):845–856, Sept. 2005.

[91] J. Max. Quantizing for minimum distortion. IEEE Trans. Inform.Theory, 6(1):7–12, Mar 1960.

[92] R. McAulay and M. Malpass. Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoust., Speech, SignalProcessing, 28(2):137–145, Apr 1980.

[93] B. L. McKinley and G. H. Whipple. Noise model adaptation inmodel based speech enhancement. In Proc. IEEE Int. Conf. Acoustics,Speech and Signal Processing, volume 2, pages 633 – 636, 1996.

[94] A. Meiri. Comments on ’source coding of the discrete Fourier trans-form’. IEEE Trans. Inform. Theory, 26(5):626–626, Sep 1980.

[95] P. Moo and D. Neuhoff. Optimal compressor functions for multidi-mensional companding. In Proc. IEE Int. Symp. Info. Theory, page515, 29 June-4 July 1997.

[96] B. Oliver, J. Pierce, and C. Shannon. The philosophy of PCM. Pro-ceedings of the IRE, 36(11):1324–1331, Nov. 1948.

[97] K. K. Paliwal and W. B. Kleijn. Quantization of LPC parameters. InW. B. Kleijn and K. K. Paliwal, editors, Speech coding and synthesis,chapter 12, pages 433–466. Elsevier, 1995.

References 45

[98] P. F. Panter and W. Dite. Quantization distortion in pulse-countmodulation with nonuniform spacing of levels. Proceedings of the IRE,39(1):44–48, 1951.

[99] W. Pearlman. Reply to comments on ’source coding of the discreteFourier transform’. IEEE Trans. Inform. Theory, 26(5):626–626, Sep1980.

[100] W. Pearlman and R. Gray. Source coding of the discrete Fouriertransform. IEEE Trans. Inform. Theory, 24(6):683–692, Nov 1978.

[101] J. Plasberg, D. Y. Zhao, and W. B. Kleijn. The sensitivity matrixfor a spectro-temporal auditory model. In Proc. 12th European SignalProcessing Conf. (EUSIPCO), pages 1673–1676, 2004.

[102] J. H. Plasberg and W. B. Kleijn. The sensitivity matrix: Using ad-vanced auditory models in speech and audio processing. IEEE Trans.Audio, Speech and Language Processing, 15(1):310–319, Jan. 2007.

[103] J. Porter and S. Boll. Optimal estimators for spectral restoration ofnoisy speech. In Proc. IEEE Int. Conf. Acoustics, Speech and SignalProcessing, volume 9, pages 53–56, 1984.

[104] S. Quackenbush, T. Barnwell, and M. Clements. Objective Measuresof Speech Quality. Prentice Hall, 1988.

[105] L. Rabiner. A tutorial on hidden Markov models and selected applica-tions in speech recognition. Proceedings of the IEEE, 77(2):257–286,Feb. 1989.

[106] B. D. Rao. Speech coding using Gaussian mixture models. Technicalreport, MICRO Project 02-062, 2002-2003.

[107] R. A. Redner and H. F. Walker. Mixture densities, maximum likeli-hood and the EM algorithm. SIAM Review, 26(2):195–239, 1984.

[108] D. A. Reynolds and R. C. Rose. Robust text-independent speakeridentification using Gaussian mixture speaker models. IEEE Trans.Speech and Audio Processing, 3(1):72–83, Jan. 1995.

[109] J. Rissanen. Generalized Kraft inequality and arithmetic coding. IBMJournal of Research and Development, 20:198, 1976.

[110] A. Rix, J. Beerends, M. Hollier, and A. Hekstra. Perceptual Evalu-ation of Speech Quality (PESQ) - A new method for speech qualityassessment of telephone networks and codecs. In Proc. IEEE Int.Conf. Acoust., Speech, Signal Processing, volume 2, pages 749–752,Salt Lake City, USA, 2001.

46 Introduction

[111] H. Sameti, H. Sheikhzadeh, L. Deng, and R. L. Brennan. HMM-basedstrategies for enhancement of speech signals embedded in nonstation-ary noise. IEEE Trans. Speech and Audio Processing, 6(5):445–455,Sep. 1998.

[112] J. Samuelsson. Waveform quantization of speech using Gaussian mix-ture models. In Proc. IEEE Int. Conf. Acoustics, Speech and SignalProcessing, volume 1, pages 165–168, May 2004.

[113] J. Samuelsson. Toward optimal mixture model based vector quantiza-tion. In Proc. the Fifth Int. Conf. Information, Communications andSignal Processing (ICICS), Dec. 2005.

[114] J. Samuelsson and P. Hedelin. Recursive coding of spectrum param-eters. IEEE Trans. Speech and Audio Processing, 9(5):492–503, Jul.2001.

[115] J. Samuelsson and J. Plasberg. Multiple description coding based onGaussian mixture models. IEEE Signal Processing Letters, 12(6):449–452, June 2005.

[116] C. E. Shannon. A Mathematical Theory of Communication. BellSystem Technical Journal, 27:379–423, 623–656, 1948.

[117] C. E. Shannon. Coding theorems for a discrete source with a fidelitycriterion. IRE National Convention Record, Part 4, pages 142–163,1959.

[118] J. W. Shin, J.-H. Chang, and N. S. Kim. Statistical modeling ofspeech signals based on generalized gamma distribution. IEEE SignalProcessing Letters, 12(3):258–261, March 2005.

[119] S. Simon. On suboptimal multidimensional companding. In Proc.Data Compression Conference (DCC), pages 438–447, 30 March-1April 1998.

[120] J. Sohn and W. Sung. A voice activity detector employing soft decisionbased noise spectrum adaptation. In Proc. IEEE Int. Conf. Acoustics,Speech and Signal Processing, volume 1, pages 365–368, May 1998.

[121] S. Srinivasan, J. Samuelsson, and W. B. Kleijn. Speech enhancementusing a-priori information. In Proc. Eurospeech, volume 2, pages 1405–1408, Sep. 2003.

[122] S. Srinivasan, J. Samuelsson, and W. B. Kleijn. Codebook-basedBayesian speech enhancement. In Proc. IEEE Int. Conf. Acoustics,Speech and Signal Processing, volume 1, pages 1077–1080, 2005.

References 47

[123] S. Srinivasan, J. Samuelsson, and W. B. Kleijn. Codebook drivenshort-term predictor parameter estimation for speech enhancement.IEEE Trans. Audio, Speech and Language Processing, pages 1–14,Jan. 2006.

[124] S. Srinivasan, J. Samuelsson, and W. B. Kleijn. Codebook-basedBayesian speech enhancement for nonstationary environments. IEEETrans. Audio, Speech and Language Processing, 15(2):441–452, Feb.2007.

[125] V. Stahl, A. Fischer, and R. Bippus. Quantile based noise estimationfor spectral subtraction and Wiener filtering. In Proc. IEEE Int. Conf.Acoustics, Speech and Signal Processing, volume 3, pages 1875–1878,June 2000.

[126] J. Su and R. Mersereau. Coding using Gaussian mixture and gener-alized Gaussian models. In Proc. IEEE Int. Conf. Image Processing,volume 1, pages 217–220, Sep. 1996.

[127] A. D. Subramaniam, W. R. Gardner, and B. D. Rao. Low-complexitysource coding using Gaussian mixture models, lattice vector quan-tization, and recursive coding with application to speech spectrumquantization. IEEE Trans. Audio, Speech and Language Processing,14(2):524–532, Mar. 2006.

[128] A. D. Subramaniam and B. D. Rao. PDF optimized parametric vectorquantization of speech line spectral frequencies. IEEE Trans. Speechand Audio Processing, 11(2):130–142, Mar. 2003.

[129] B. Subramaniam, A.D.; Rao. Speech LSF quantization with rate in-dependent complexity, bit scalability and learning. In Proc. IEEEInt. Conf. Acoustics, Speech and Signal Processing, volume 2, pages705–708, 2001.

[130] H. Takahata. On the rates in the central limit theorem for weaklydependent random fields. Probability Theory and Related Fields,64(4):445–456, Dec. 1983.

[131] M. Tasto and P. Wintz. Image coding by adaptive block quantization.IEEE Transactions on Communication Technology, COM-19(6):957–972, 1971.

[132] D. M. Titterington. Recursive parameter estimation using incompletedata. J. Roy. Statist. Soc. B, 46(2):257–267, 1984.

[133] D. E. Tsoukalas, J. N. Mourjopoulos, and G. Kokkinakis. Speechenhancement based on audible noise suppression. IEEE Trans. Speechand Audio Processing, 5(6):497–514, Nov. 1997.

48 Introduction

[134] V. A. Vaishampayan, N. J. A. Sloane, and S. Servetto. Multiple de-scription vector quantization with lattice codebooks: design and anal-ysis. IEEE Trans. Inform. Theory, 47(5):1718–1734, Jul. 2001.

[135] P. Vary and R. Martin. Digital Speech Transmission - Enhancement,Coding and Concealment. Wiley, 2006.

[136] N. Virag. Single channel speech enhancement based on masking prop-erties of the human auditory system. IEEE Trans. Speech and AudioProcessing, 7(2):126–137, Mar. 1999.

[137] S. Voran. Objective estimation of perceived speech quality - Part I:Development of the measuring normalizing block technique. IEEETrans. Speech, Audio Processing, 7(4):371–382, 1999.

[138] S. Voran. Objective estimation of perceived speech quality - PartII: Evaluation of the measuring normalizing block technique. IEEETrans. Speech, Audio Processing, 7(4):383–390, 1999.

[139] D. Wang and J. Lim. The unimportance of phase in speech enhance-ment. IEEE Trans. Acoust., Speech, Signal Processing, 30(4):679–681,Aug 1982.

[140] S. Wang, A. Sekey, and A. Gersho. An objective measure for pre-dicting subjective quality of speech coders. IEEE J. Selected Areas inCommun., 10(5):819–829, 1992.

[141] N. Wiener. Extrapolation, Interpolation, and Smoothing of StationaryTime Series. New York: Wiley, 1949.

[142] I. Witten, R. Neal, and J. Cleary. Arithmetic coding for data com-pression. Communications of the ACM, 30(6):520–540, June 1987.

[143] P. J. Wolfe and S. J. Godsill. Towards a perceptually optimal spectralamplitude estimator for audio signal enhancement. In Proc. IEEE Int.Conf. Acoustics, Speech and Signal Processing, volume 2, pages 821–824, 2000.

[144] P. L. Zador. Topics in the asymptotic quantization of continuousrandom variables. Bell Labs Tech. Memo, 1966.

[145] D. Y. Zhao and W. B. Kleijn. Multiple-description vector quantizationusing translated lattices with local optimization. In Proc. IEEE GlobalTelecommunications Conference, volume 1, pages 41 – 45, 2004.

[146] D. Y. Zhao and W. B. Kleijn. On noise gain estimation for HMM-based speech enhancement. In Proc. Interspeech, pages 2113–2116,Sep. 2005.

References 49

[147] D. Y. Zhao and W. B. Kleijn. HMM-based gain-modeling for enhance-ment of speech in noise. In IEEE Trans. Audio, Speech and LanguageProcessing, volume 15, pages 882–892, Mar. 2007.

[148] D. Y. Zhao, J. Samuelsson, and M. Nilsson. GMM-based entropy-constrained vector quantization. In Proc. IEEE Int. Conf. Acoustics,Speech and Signal Processing, pages 1097–1100, Apr. 2007.

[149] Y. Zhao. Frequency-domain maximum likelihood estimation for au-tomatic speech recognition in additive and convolutive noises. IEEETrans. Speech and Audio Processing, 8(3):255–266, May 2000.

[150] X. Zhuang, Y. Huang, K. Palaniappan, and Y. Zhao. Gaussian mix-ture density modeling, decomposition, and applications. IEEE Trans.Image Processing, 5(9):1293–1302, Sept. 1996.

Documents

Model Based Speech Enhancement and Coding12190/FULLTEXT01.pdf · 2007-06-29 · system provide improved speech quality, particularly for non-stationary noise sources. In Paper C,