Bayesian Predictive Classification With Incremental Learning for Noisy Speech Recognition

Bayesian Predictive Classification With Incremental Learning for Noisy Speech

Recognition

朱國華

89/12/06

References

簡仁宗、廖國鴻 ,“ 具有累進學習能力之貝氏預測法則在汽車語音辨識之應用” , ROCLING XIII, pp.179~197, 2000. H. Jiang, K. Hirose and Q. Huo, “Robust Speech Recognition

Based on a Bayesian Prediction Approach”,IEEE Transaction on Speech and Audio Processing, Vol. 7, no. 4, pp. 426-440, July,1999.

J.T. Chien, “Online Hierarchical Transformation of Hidden Markov Models for Speech Recognition”, IEEE Transaction on Speech and Audio Processing, Vol. 7, no. 6, pp. 656-667, November 1999.

Contents Introduction Problem Formulation Some Decision Rules for ASR Transform-Based Bayesian Predictive

Classification(TBPC) Derivation of the Bayesian Predictive Likelihood

Measurement (BPLM) Online Prior Evolution(OPE) Experiments and Discussions

Introduction

Transform-based Bayesian predictive classification robustness decision rules for noisy speech recognition.

Online prior evolution to copy with the nonstationary testing environment (both for environmental and speaker’s variation).

Problem Formulation

Approximate MAP(Quasi Bayesian, QB) Estimation of ASR:

n : index of the input test utteranceW : word content or syllable string of input utternceη: acoustic transformation parameter(function)

n : ={X1,X2,…, Xn } be the i.i.d. and successively observed block samplesφ(n-1) : represents the estimated environmental statistic from previous input uttera

nce X1,X2,…, Xn-1 .

Problem Formulation(cont.)

Assume W and ηare independent, so the previous QB estimation can be rewritten as follow: {p(W) : Language Model}

Some Decision Rules for ASR

Plug-In MAP Rule: The performance of plug-in MAP decision rule de

pends on the choice of estimation approach (ML, MAP, discriminative training, etc.)., the nature and size of the training data, and the degree of the mismatch between training and testing conditions.

Point estimation.

Some Decision Rules for ASR(cont.)

Minimax Rule : Nonparametric Compensation. Minimizes the upper bound of the worst-case probability

of classification error.

Assume the unknown true parameter is a r. v. with uniform distribution in a neighborhood region .

TBPC

Transform Bayesian Predictive Classification(TBPC) Rule:

where likelihood is obtained by:

TBPC (cont.)

TBPC treat the transformed parameter as a random variable(not the point estimation).

The average is taken both with respect to the sampling variation in the expected testing data and with respect to the uncertainty described by the prior pdf p(Xn|W,).

TBPC can be applied both to supervised and unsupervised learning environment.

TBPC (cont.)

Transformation-based Adaptation: For a given HMM model with L states and K

mixtures ={i}={ik,ik,rik}, i=1~L, k=1~K, the estimated transformation function G(n)() of the given testing utterance n is defined as :

where c is the index of the transformation cluster (hierarchical transformation).

Implementation: (Approach I) Considering the missing data problem, we use the Viterbi

TBPC for the likelihood :

Frame-synchronous Viterbi Bayesian search algorithm can be utilized to overcome the memory space and computation load(Jiang, IEEE SAP 1999).

TBPC (cont.)

Implementation: (Approach I cont.)• In Jiang ( IEEE SAP 1999), they only considered the

uncertainty of the mean vectors of CDHMM with diagonal covariance matrices and assume they are uniformly distributed in a neighborhood of pretrained means(no online adaptation).

TBPC (cont.)

Implementation: (Approach II) Bayesian Predictive Density Based Model

Compensation(BP-MC) of the K mixture state observation pdf is :

• where f(xt(n)|ik) is the Bayesian predictive density and is defined below:

TBPC (cont.)

Implementation: (Approach II cont.) The choice of prior pdf:

• In Chien (RocLing 2000), he adopted the multivariate Gaussian pdf which is based on the conjugate prior of statistical.

TBPC (cont.)

Derivation of the BPLM

Since p(xt(n)|ik,c) and p(c|c

(n-1)) are both Gaussian, we can derivate the f(xt

(n)|ik) : (assume both kc and rik are diagonal precision matrix)

Viterbi Approach:

where (sn*,ln*) is the most likely state

and mixture sequence corresponding to Xn, respectively.

Online Prior Evolution

The parameter statistics of the c th cluster are:

Online Prior Evolution (cont.)

Where

From above derivation, we can online adapted(learning) the c

(n-1) from c(n-1).


We can estimate the initial parameter c

(0) from the prior given training data.


Experiments Training and testing data set I: (Mic1,clean)

70 males and 70 females , each person records 10 continuous Mandarin digit sentence. 50 males’ + 50 females’ utterance are for training, the other 20 males’ and females’ are for testing.

Training and testing data set II: (Mic2,noisy) 2 males+2 females in Toyota Corolla 1.8 3 males+3females in Nissan Sentra 1.6 Each speaker records individually 10 sentences in idle speed, 20 sentences in 50km speed, and 30 sentences in 90 km speed . Arbitrary choose 5 sentences for training, others for testing.

Experiments (cont.)

Signal to Noise Ratio :SNR (dB)

Sentra Corolla Average0km 5.63 10.3 7.9650km -6.53 0.34 -3.190km -10.14 -3.77 -6.96clean 25.1

Experiments (cont.) Recognizer Structure

Features : 12 order LPC derived cepstrum and -cepstrum plus and log energy.

HMM Model : 7 states and 4 mixtures for each digit model, plus 3 different single state background noise model.

Experiments (cont.)

Baseline results:

Test sentence Digit error rate (DER)

Clean 10.60km 25.6150km 54.9790km 62.33

Experiments (cont.)

Supervised DER corresponding to the number of training data.

Experiments (cont.) Unsupervised TBPC-OPE DER . (parentheses means the % improvement)

In 2 clusters case , 10 digits are one cluster , and 3 background noise model are the other.

1 cluster 2 cluster0km 14.32(44.0) 12.53(51.0)50km 39.94(27.3) 36.24(34.0)90km 51.65(17.1) 46.32(25.6)

Experiments (cont.) Unsupervised Performance Comparison

of different BPC approaches:Baseline Jiang

(‘99 IEEE SAP)Surendran

(1998 ICASSP)TBPE-OPE (ROCLing 2000)

Clean 10.6 8.51(19.2) 8.4(20.7) 7.47(29.5)0km 25.6 18.6(27.3) 15.43(39.7) 12.53(51.0)50km 55 49.83(9.4) 38.38(30.2) 36.24(34.1)90km 62.3 60.25(3.3) 50.91(18.3) 46.32(25.6)

Discussions Jiang’s results are the worst among all because of the fixed prior distribution. Surendran’s results are worse than TBPC-OPE because the adaptation of prior pdf is just count on the current input utterance but not the accumulated ones. We can also adjust the weight (Dirichlet dist.) and variance (Wishart dist.) with the mean at the same time of the HMM model of the BP-

MC approach.

Documents

Bayesian Predictive Classification With Incremental Learning for Noisy Speech Recognition