Dual-Microphone Speech Enhancement with Robustness to … · 2010. 7. 13. · Shinya Matsui, Yoji Ishikawa Information Technology Laboratory Asahi Kasei Corporation, Atsugi AXT Main

Dual-Microphone Speech Enhancement with Robustness to Variations in Microphone Gain Characteristics

Shinya Matsui, Yoji Ishikawa

Information Technology Laboratory Asahi Kasei Corporation,

Atsugi AXT Main Tower 22F, Okata 3050, Atsugi, Kanagawa, 243-0021 JAPAN Email: {matsui.sk, ishikawa.yf}@om.asahi-kasei.co.jp

Abstract—This paper presents a new dual microphone noise suppression method with robustness against variations in microphone gain characteristics. The method has a BeamFormer (BF) and a 3-stage gain calculation block consisting of “Separation Gain”, “Post Gain” and “Relaxation Gain” in the frequency domain. Each gain value has its own particular role, and an appropriate gain is calculated based on these gain values. The proposed method was applied to a dual microphone headset. The performance was shown by subjective test and by objective test using PESQ and ΔSNR score.

Index Terms—Array signal processing, Robustness for microphone gain, Post-Filter

I. INTRODUCTION Noise reduction technologies in Bluetooth-headsets apparatus have become a crucial factor for sending clear voice in noisy environments. In recent years, multi-channel algorithms have become indispensable to suppress non-stationary noises such as music, babbling and burst noises that cannot be suppressed with a single microphone. Previous researchers have proposed methods such as Superdirective BeamFormers (BF) [1], blind source separation (BSS) based on Independent Component Analysis (ICA) [2][3] and Multi-channel post-filter [4][5]. However, Superdirective BFs give low performance when only a few microphones are used. BSS suffers from a susceptibility to sound reverberations, and a high calculation cost for embedded systems. In addition, a lot of methods assume plane sound waves and therefore don’t consider the distance decay effect of energy. Since the energy difference between each microphone is especially large in near-field scenarios, the desired performance cannot be achieved. On the other hand, some techniques using the distance decay effect have been proposed. However, it is difficult to achieve sufficient performance in a small headset, because the headset has restrictions on the distance between the microphone and mouth, as well as between microphones. Thus, the energy difference between microphones is only about 3~4dB for the target voice. Furthermore, for almost all microphone array techniques, “the difference in microphone characteristics” is a serious problem. In general, the lower the cost, the more variations are observed from microphone to microphone. Some frequency characteristic variations are about 3dB in each frequency bin. To solve these problems, we proposed a Butterfly Subtraction Array (BSA) [6], which is a sound source separation technique. BSA has the advantage of robustness in microphone gain, and therefore can apply to near-field sound. However, there were problems such as wide directivity and musical noise for a diffuse noise environment.

The purpose of this study is to solve BSA’s problems as described above and to propose a useful algorithm for headsets. First, a “phase

rotator” is incorporated in the BSA to get better narrow directional characteristics. Second, a two stage gain calculation block is prepared after the BSA, “Post Gain” and “Relaxation Gain”. “Post Gain” is used to suppress the BSA’s residual noise by Spectrum Subtraction (SS). “Relaxation Gain” is used to minimize the sound distortion by using the human auditory model.

Finally, subjective test and objective tests (PESQ and ΔSNR score) are performed to demonstrate the usefulness of the propose method. The results show very high speech quality and noise reduction performance.

II. BUTTERFLY SUBTRACTION ARRAY Fig.1. is the block diagram of the method proposed in this paper, and the dotted line region is the block diagram of the BSA. The input signals in the frequency domain are defined as Txx )(),()( 21X , the output signal of the Beam Former )(BFX is calculated as:

)()()( XW HBFBFX

Where HA denotes the conjugate transpose of matrix A . )(BFW is the Beam Former weight matrix in each frequency bin , and is derived from the plane wave assumption. If one Beam Former weight is defined as TBF ww 211 ,W and another Beam Former is defined as 2BFW , the BSA captures the following relationship:

12 BFBF WW

In this case, the Power Spectrum Density (PSD) of the BSA output that intends to capture sound sources from oo 90~-90 can be calculated as follows:

21212121

2

2

2

12

Re2Re2

)()()()()(

xxwwxxww

X HBFHBFBSA XWXW

If the gain from Mic.1 becomes times as large as the input )(1x , the PSD of the BSA is calculated as follows:

212121212 ReRe2)( xxwwxxwwX BSA

In (4), is factorized, so variations in microphone gain characteristics do not influence the separation performance. The output signal of the BSA in the frequency domain is calculated as follows:

)()()( 1BFBSABSA XGX

2

1

22

21

)()(

)0,)()()()((max)(

BF

BFBFBSAG

WX

WXWX

Where )(BSAG is the “Separation Gain” derived from BSA.

III. IMPROVED BUTTERFLY SUBTRACTION ARRAY In case of a headset, the relative positions between the microphones and the mouth (target source) don’t change greatly. The target source can be assumed to come from around 0 degree. Hence the SNR can be improved by the direction characteristic of the BSA by adding phase rotator block and equalizer block. We now modify (1) to include the phase rotator )(D to operate the phase lag between the microphones, and to adjust the direction characteristic as follows:

)()()()( XDW HBFBFX

where

elsed

cifcdcjdiagcdjdiag 1sin

1,exp1,sinexp

)(1

0D

900 0 is the phase rotation angle and can be used to change the direction characteristic of the BSA. d is the distance between the microphones, c is the sound velocity. If some frequency bins don't fulfill the spatial sampling theorem, the BSA will not perform properly from them. So )(D is calculated differently depending n whether or not the spatial sampling theorem is fulfilled. An equalizer is applied to the BSA output for compensating the frequency characteristic of the target sound to be flat. Since the relative position between Mic.1 and the mouth is almost fixed, the compensation is done for the presumed position of the target source. The sound propagation model Tmmm XX )(),()( 21X reaching each microphone from the target sound position can be described as follows:

cudrjuX

X

mm

m

1exp)(

1)(1

2

1

where, 21cos21 mmm rru

mr is the distance from Mic.1 to the target source, and m is the direction of target source. The separation gain under this propagation model ))(( mBSAG X is obtained by multiplying each component of (9) to (7), then the equalizer that compensates this model can be calculated as follows:

))((

1

mBSAm G

EX

Therefore, the output of the Improved BSA is:

)()()()( 1BFBSAmBSA XGEX

To evaluate the above improvement, we conducted a simulation analysis. The distance between the near-field source and Mic.1 was 0.06 [m], between the far-field and Mic.1 was 1.5 [m], and d was 0.03 [m.]. Fig.2. shows the simulation result by plotting the log-scale PSD of the BSA output. The BSA and the Improved BSA achieves a similar separation performance for the near-field sounds. Moreover, the Improved BSA possesses narrower directional characteristics than the conventional BSA.

IV. POST GAIN CALCULATION Post Gain aims to suppress the residual noise of the BSA output further. In diffuse noise environments, musical noise can be produced at the BSA output because the separation gain )(BSAG in each frequency bin tends to vary greatly from frame to frame. In general, to suppress the musical noise, there are known techniques for adding an original sound. In this study, after adding the BF1 output to the BSA output to decrease the musical noise, spectrum subtraction is applied to the reduce residual noise of the BSA. An adaptive filter and a noise equalizer block are added as well to calculate the estimation noise used by the spectrum subtraction.

A. Musical Noise Suppression The output of the BF1 is mixed with the output of BSA.

)()(

11)()()()(1)()(

1

1

1

sBF

BSAmsBF

BFsBSAss

GXGEX

XXX

Where 10 s is the weighting factor.

B. Noise Spectrum Estimation Let )(ts and )(tn j denote respectively the target source and the uncorrelated additive noises in the time domain. The transfer function from the target source to Mic.1 is denoted as 1sh , the transfer function from the target source to Mic.2 is denoted as 2sh , the transfer function from the noise to Mic.1 is denoted as 1jnh , the transfer function from the noise to Mic.2 is denoted as 2jnh . In this case, the result of Mic.1 and Mic.2 are as follows:

Kj jns

tttxj1 111

)()()( nhsh

Kj jns

tttxj1 222

)()()( nhsh

the output of this block is calculated as follows:

)()()()( 12 tttxtx TABM xH

Where )(tH is the adaptive filter transfer function. For instance, the coefficient of the adaptive filter is calculated by NLMS and updates the adaptive filter coefficient in the frame section where the VAD detected target speech. But even if the VAD is not complete, performance is not so degrading. Assuming the target source and noises are uncorrelated, the output of the adaptive filter )(txABm is calculated as follows:

)()()()()()( 11121 11 2 ttttttx sT

ssK

j jnTK

j jnABM jjshHhhnhHnh

And, if the adaptive filter estimates the transfer function to cancel the target source like 112)( sst hhH , then the residual signal becomes:

Figure 1. Structure of the proposed method

Kj jn

Tss

K

j jnABMtttx

jj 1 11121 2

)()()( nhhhnh

Therefore, this block can partially detect noise components except for the target source. This blocking matrix is robust against the difference in microphone gains because a fixed filter is not used. And the target source can’t be cancelled because of causality.

C. Noise Spectrum Adjustment The Noise spectrum estimated by the Adaptive Filter and the Noise Spectrum from the BSA output are alike but the energy is different. Therefore, this block compensates the noise level of )(BSAX to fit the noise level of )(ABMX . The periodgram 2)(BSAX and

2)(ABMX fluctuate rapidly from frame to frame, so it is preferable to use the first-order recursive version of the periodogram in the frame section where the VAD detected noise. The smoothed data is defined as

2)(BSAX , 2)(ABMX . The noise equalizer value in each frequency bin )(EQH is derived as follows:

2

2

)(

)()(

ABM

SEQ

X

XH

The estimated noise included in BSA is calculated as follows:

　)()()( ABMEQd XH

D. Speech Spectral Amplitude Estimator In the post filter, the log Minimum Mean Square Error (logMMSE) estimator [7] is used. However, non-stationary noise such as music, babbling and burst noises doesn't meet the assumption of the logMMSE because the logMMSE assumes the noise to follow a normal distribution. Hence our method uses a Speech Spectral Amplitude Estimator that suppresses non-stationary noise better than logMMSE. The spectral gain is calculated with an estimated a priori and posteriori SNR. The a priori SNR is given by

0,1)(

)(max)( 2

2

d

SX

The a posteriori SNR is derived from the Decision-Directed Approach as follows:

)(1)(

)1,(),( 2

2

d

s mXm

Where 0 <

VI. EXPERIMENTAL RESULT To show the usefulness of the proposed method, we conducted a verification by experiment in real environments. The headset was attached in a HATS, and the distance from mouth to Mic.1 was 6 [cm], d=3 [cm], sampling rate was 8 [kHz]. Speech quality and noise suppression level were measured under noisy environments (5.1ch surround music and bubbling noise). The PESQ scores [9] and improved SNR (ΔSNR) were measured for evaluation. Table 1 shows the PESQ value and ΔSNR of each algorithm. In the table, 1ch-NS is a stationary noise suppressor using logMMSE; BF+1chNS is the result obtained from combining a 2chBF with a 1chNS; BSA is the conventional method of BSA; and BSA+logMMSE+relax is using logMMSE instead of our proposed Post Gain Estimator. From the table, it can be seen that the improvements offered by our method in

terms of PESQ and ΔSNR are confirmed. Furthermore the improvement values are the largest from amongst all evaluated algorithms. The robustness against variations in microphone gain characteristics was evaluated as well. In the table, Proposed ( Mic.2:

3dB ) are the results from the proposed method using data from Mic.2 that are 3dB larger or smaller than the input signal level of Mic.2. The results show that our proposed method has robustness against variations in microphone gain characteristics.

Figure 3 shows the spectrograms for a speech signal corrupted with babble noise at a 3 dB SNR. Even though the input signal involves non-stationary interference, our method is able to remove most of the interference and the target spectrum is better preserved compared with the other methods. The results of our experiments clearly demonstrate the effectiveness of the proposed method.

VII. CONCLUDING REMARKS In this paper, we proposed a new microphone array method with robustness against variations in microphone gain characteristics, and applied the method to headsets as one application example. The performance was evaluated in adverse environments. The results show the proposed method provides comfortable sound quality, good noise suppression performance and robustness against variations in microphone gain characteristics. [1] M. Brandstein and D. Ward, Eds,. Microphone Arrays, Berlin: Springer-

Verlag, 2001 [2] H. Sawada, S. Araki, and S. Makino, “Frequency-domain blind source

separation,” in Blind Speech Separation, S. Makino, Te-Won Lee, and H. Sawada, Eds., Springer, pp. 47-78, Sept. 2007

[3] Y. Takahashi, K. Osako, H. Saruwatari, K. Shikano, “Blind Source Extraction For Hands-Free Speech Recognition based on Wiener Filtering and ICA-based Noise Estimation,” Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), pp.164-167, May 2008.

[4] R. Zelinski, “A Microphone Array with Adaptive Post-filtering for Noise Reduction in Reverberant Rooms,” in Proc. of ICASSP’88 New York, NY, USA, Apr. 1988, pp. 2578-2581

[5] I. Cohen, “Analysis of two-channel generalized sidelobe canceller (GSC) with post-filtering,” Tech. Rep., EE Pub. 1332, Technion-IIT, Haifa, Israel, July 2002

[6] S. Matsui, K. Nagahama, M. Shozakai, “Sound Source Separation with Robustness to Variation of Microphone Gain Characteristics,” Biennial on DSP for in-Vehicle and Mobile Systems, June. 2007

[7] Y. Ephrain and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-33, pp. 443-445, 1985

[8] S. Gustafsson, P. Jax, and P. Vary, “A novel psychoacoustically motivated audio enhancement algorithm preserving backgroud noise characteristics,” IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98, vol. 1, pp.397-400 vol.1, 12-15 May 1998.

[9] “ITU-T recommendation P.892. Perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Geneva, Switzerland 2001.

Time-frequency analysis of clean speech signal

Time-frequency analysis of microphone signal

1ch Noise Suppressor

Beamformer + 1ch Noise Suppressor

Butterfly Subtraction Array

BSA + logMMSE + Relax

Proposed method

TABLE I. PESQ VALUE AND ΔSNR music noise babbleing noise

PESQ ΔSNR PESQ ΔSNR 1chNS 2.49 4.92 2.67 5.04 BF+1chNS 2.5 5.58 2.77 6.75 BSA 2.52 7.35 2.73 6.91 BSA+logMMSE+Relax 2.73 9.79 3.07 10.72 Proposed 2.76 10.79 3.08 11.03 Proposed ( Mic.2: -3dB ) 2.73 10.91 3.03 10.70 Proposed ( Mic.2: +3dB ) 2.79 9.91 3.09 10.79

Figure 3. Example of time-frequency analysis with input SNR 3 dB Time [seconds]

Documents

Dual-Microphone Speech Enhancement with Robustness to … · 2010. 7. 13. · Shinya Matsui, Yoji Ishikawa Information Technology Laboratory Asahi Kasei Corporation, Atsugi AXT Main