Speech processing research paper 14

International Conference on Control, Automation and Systems 2007Oct. 17-20,2007 in COEX, Seoul, Korea

Wavelet-based Underdetermined Blind Source Separation of Speech MixturesTakehiro Hamada' , Kazushi Nakano! and Akihiro Ichijo '

IThe University of Electro-Communications, 1-5-1 Chofu-ga-oka, Chofu, Tokyo 182-8585, Japan(Tel :+81-42-443-5190 Email : [email protected])

Abstract: Blind Source Separation (BSS) is an approach to estimate original source signals only by observed signalsthrough multiple sensors without any prior knowledge about their mixture process . The BSS is currently expected tobe applied to such a broad field as preprocessing of speech recognition, analysis of bio-signals, etc. In this paper, wefocus on underdetermined BSS (i.e., BSS with more sources than sensors). In underdetermined BSS, there are manyestimation algorithms based on sparse representation of source signals . The accuracy of decomposition of source signalsdepends strongly on sparsity hypothesis . For this purpose, it is effective to preprocess observed signals by using a lineartransformation with respect to the base so as to make it more sparse representation of signals . We now propose anapplication of time-frequency analysis to preprocessing of sound source signals using the RI-Spline wavelet transform.The effectiveness of the proposed approach is demon strated through a computer simulation using real data .

Keywords: Blind source separation, RI-Spline wavelet, Sparse representation, Underdetermined system

n

2. PROBLEM STATEMENT2.1 Mixture Model

Consider the following mixture model in the time do-main:

Xi(t) = 2..: aijSj(t), i = 1, ... , m , t = 1, ... , K (I)j =1

where Sj(t) are source signals, aij are mixing coeffi-cients, and Xi(t) are observed signals . Defining x( t) =[Xl (t ), ..., xm(t )]T and s(t) = [SI(t ), ..., sn(t) ]T, (I) iswritten in the following simple form:

(2)

Estimatedsource signals

x(t) = As(t), t = 1, ... , K

Time Fourier Transform (STFT) is applied . STFT has aconstant temporal resolution regardless of frequency .

On the contrary, Wavelet Transform (WT) enablestemporal resolution to change with frequency. We pro-pose an approach to TF-BSS by using the RI-Splinewavelet[6] for preprocessing. The RI-Spline wavelet hasbeen developed as a new type of multi resolution analysisin the complex domain . Lastly, we demonstrate the ef-fectiveness of the proposed approach through a computersimulation.

(unknown) (known)Source Observedsignals signalsCS _m _~ ~ \id::b1 Estimation. m)(see Fig. 1) cannot be solved by using a standard ICA ap-proach developed for n = m, since there does not existan inverse of mixture matrix in the underdetermined case .Thus, a great deal of attention has been paid to sparserepresentation approaches in resent years . The sparsityproperty means that a signal sequence possesses at leastone zero-component. Bofill, et al.[2] have proposed amethod for estimating a mixture matrix by using a poten-tial function based on data concentration along the mix-ing direction in the scatter diagram of observed signals .Other estimation methods have been proposed by usingthe ratio of observed signals under an appropriate spar-sity hypothesis[3-4] . Lewicki, et al.[5] as well as Bofill,et al.[2] have proposed a method for estimating sourcesignals based on an (I-norm solution under the assump-tion that source signals follow a Laplacian distributionwhich is sparse .

Since the accuracy of decomposition of source signalsdepends strongly on the sparsity hypothesis, it is effec-tive to transform observed signals into their more sparserepresentation for preprocessing. As well known, powerof sound signals is concentrated at a specific frequencyband. We could have more sparse representation of sig-nals by expanding the signals in the time-frequency (TF)domain . Yilmaz, et al.[3] have proposed a method forTF-BSS under the sparsity hypothesis, where the Short-

978-89-950038-6-2-98560107/$15 @ICROS2790

0 5 10 15 20

-15

-10

-5

0

5

10

15

20

t

v(t)

0 5 10 15 20

-15

-10

-5

0

5

10

15

20

t

v(t)

Fig. 2 left: high sparsity signal sequences. Right: lowsparsity signal sequences.

where A = (aij) is a mixture matrix. Our goal is torecover the source signals fs(t)jt = 1; :::;Kg only usingthe observed signals fx(t)jt = 1; :::;Kg up to the well-known permutation and scale indeterminacies in the BSSproblem.

2.2 Two-stage ApproachThe usual BSS problem (i.e., m = n) is with one-to-

one correspondence between estimation of mixture ma-trix and estimation of source signals. This is because thesource signals are uniquely determined by the existenceof an inverse mixture matrix.

On the other hand, the underdetermined BSS problemis not with one-to-one correspondence in the above sense.Once obtaining an estimate of mixture matrix ~A, candi-dates of the solution of (2) are all the elements of thefollowing set:

fs0g+Ker( ~A) (3)where s0 is an arbitral solution of (2), and Ker ~A is akernel (space) of ~A. From the above equation, we haveKer( ~A) = f0g if (2) has an unique solution. But, this isnot true since dimKer( ~A) = nrank( ~A) = nm > 0.Thus, it is necessary to select an optimal solution in asense from a set of solutions satisfying (2).

The underdetermined BSS can be solved by the fol-lowing two-stage approach [2] :

First stage (Estimation of Mixture Matrix) From themeasurements fx(t)jt = 1; :::;Kg, obtain an esti-mate of mixture matrix ~A.

Second stage (Estimation of Source Signals) Frommeasurements fx(t)jt = 1; :::;Kg and an estimateof mixture matrix ~A, obtain the estimates of sourcesignals.

3. TIME-FREQUENCY ANALYSISIn this section, we consider an expansion of observed

signals in the TF domain as preprocessing of the two-stage approach. We dene the sparsity, and show thatmore sparse representation could be obtain by use of theRI-Spline wavelet.

3.1 Sparsity of signalsAs seen in the following subsections, the sparsity of

source signals plays a key role in the two-stage approach.The sparsity is dened as a rate of the number of zero

components included in signal sequences. That is,

= #(ft j v(t) = 0g) (4)The sparsity gets higher as gets lager (see Fig. 2 left);the reverse is also true (see Fig. 2 right), where #() isthe number of elements of the set. As seen instinctivelywhen some signals with high sparsity are mixed, it is easyto decompose source signals since there is a little overlapbetween source signals in each indices.

3.2 Basis for sparse representationFor sparsity-based decomposition algorithms, it is im-

portant to transform the base in order to make source sig-nals more sparse as preprocessing. In this subsection, weconsider what kind of transformations are suitable for im-proving the sparsity.

Now dening si = [si(1); :::; si(K)]T consisting oftime series of a source signals, we express si in terms ofthe basei.

si =KXl=1

sî(l)l (5)

Improvement of sparsity by using a base transformationleads to increasing the number of zero components in anew coordinate sî more than the previous one si. Sincepower of si and sî is nite, we can say that a great num-ber of zero components are concentrated on a small num-ber of coordinates, i.e., most of power is concentratedonly on a small number of coordinates when consideringa complement of (4). Basically, concentration of poweron a small number of bases is just sparsity. A soundspectrum is well known to be concentrated at a frequencyband called formant frequency. Thus, we could get moresparse representation by selecting a base among i inorder to extract the TF information as much as possible.

3.3 RI-Spline waveletYilmaz, et al[3]. uses STFT for TF analysis as pre-

processing of the two-stage approach. Selection of thewindow size in STFT is closely related with sparsity ofsignals. This is because power is concentrated at a widerfrequency band since a frequency band including poweris changing in the window when a signal in the win-dow is non-stationary. Sound sources are globally non-stationary signals. In general, we can consider that anon-stationary range gets longer at low frequency bandand gets shorter at higher frequency band. For an ef-cient transformation, it is desired to change the windowsize according to frequency band. However, the windowsize is constant in STFT.

On the other hand, the window size is changable inWT so as to improve the temporal resolution at the highfrequency band, and to improve the frequency resolutionat the low frequency band. Thus, it can be expected toobtain more sparse signals by using WT. In this paper,we propose an estimation algorithm using TF analysis ofWT. Now, we apply a discrete complex wavelet satisfyingthe following conditions:

2791

Wavelet

transform

Estimating

mixing matrix

Estimating

source signals

Inverse

wavelet transform

Wavelet

transform

Estimating

mixing matrix

Inverse

wavelet transform

x(t) x(j,k)^

x(j,k)Â;

s

~

s(j,k)~^ ~(t)

Fig. 3 Procedure of underdetermined BSS

The inverse transform should be guaranteed by thediscrete wavelet transform (DWT).

The complex wavelet should be needed for use ofphase information in estimation of the mixture ma-trix.

Here, we adopt the RI-Spline wavelet proposed by Z.Zhang, et al.[6], which has shift invariance. The motherwavelet of the RI-Spline wavelet consists of the 4th-orderSpline wavelet R in the real part and the 3rd-orderSpline wavelet I in the imaginary part. Similarly, thescaling function consists of the 4th-order Spline scalingfunctionR in the real part and the 3rd-order Spline scal-ing function I in the imaginary part.

= R + iI (6) = R + iI (7)

In the above equations, i is the imaginary unit. Based onwavelet and scaling functions, we generate j;k;j;k sothat

j;k(t) = 2j=2(2jt k) (8)j;k(t) = 2j=2(2jt k) (9)

Using these bases, we can get a wavelet transform sî ofsi.

si =1Xj=J

Xk

sî(j; k)j;k +Xk

sî(J; k)J;k (10)

Note that this transform is a linear mapping from RK toCK , which retain the phase information because of s^ 2CK .

Next, we consider to apply TF analysis to estimation ofthe mixture matrix. Dening X = [x(1); :::;x(K)] andS = [s(1); :::; s(K)], we rewrite (2) as X = AS. Ex-pressing a matrix representation of linear mapping basedon the RI-Spline wavelet in terms of G, and multiplyingGT to the right hand side, we have

XGT = ASGT , X^ = AS^ (11)From (11), we can see that a mixture matrix is invarianteven when taking WT. Thus, we have the estimates ofsource signals in the time domain by estimating them inthe TF-domain and taking their inverse wavelet transform(see Fig. 3).

4. TWO-STAGE APPROACHIn this section, we refer to a method for estimating a

mixture matrix and source signals.

4.1 Algorithm for estimating mixing matrixYilma, et al. have proposed an estimation algorithm

using the ratio of measurement signals based on a spar-sity hypothesis[3]. They assume a strong sparsity hypoth-esis that only one source signal has power for all indices.Napoletani, et al. have proposed a method which weak-ens the sparsity hypothesis by Yilma et al., who took intoaccount the phase characteristics of the ratio of measure-ment signals[4]. Hereafter, we will explain an estimationalgorithm of mixture matrix based on Napoletani, et al.

First, we assume the sparsity of source signals for es-timating a mixture matrix.

Assumption 1 Denoting a set of indeces which onlycertain source signals s^l(j; k) have power as El =f(j; k) j s^l(j; k) 6= 0; sî(j; k) = 0(l 6= i)g, we have

El 6= ;;8l (12)Calculating the ratio Qi(j; k) of measurement signals,

we have

Qi(j; k) =xî(j; k)x^1(j; k)

(13)

=ails^la1ls^l

=aila1l

2 R (14)

for an index (j; k) 2 El. Then, the phase differencebetween the denominator and numerator is zero, i.e.,Qi(j; k) becomes a real number. Therefore,Qi(j; k) 2 Rcan take only n values of

ai1a11

; :::;aina1n

. Each value

is just a coefcient normalized using the rst row of A.From this idea, we have the following algorithm:

for i=2:m1.1 Calculate the ratio Qi(j; k) of measurement sig-nals.

1.2 Obtain Qi = fQ(j; k) j angle[Q(j; k)] = 0g.1.3 Obtain n peaks qi1; :::; qin based on histogram ofQi.

end2 Obtain the following estimate of mixture matrixA:

~A =

266641 1q21 q2n...

......

qm1 qmn

37775 (15)4.2 Algorithm for estimating source signals

As stated in subsection 2.2, we have to select an opti-mal solution in a sense since the source signals cannot beuniquely determined even using the estimate of mixturematrix ~A.

Lewicki et al. have proposed a method for determiningthe optimal solution so as to maximize the posterior prob-ability p(~^s(j; k)jx^(j; k) of ~^s(j; k) when knowing x^(j; k)

2792

and ~A[5]. Then, an l1-norm solution is derived under theassumption that the probability distribution is of Lapla-cian.

We consider to maximize the posterior probabilityp(~^s(j; k)jx^(j; k); ~A) of ~^s(j; k) under the assumption ofx^(j; k) = ~A~^s(j; k). From Bayes' theorem, we have

p(~^s(j; k)jx^(j; k); ~A) = p(~^s(j; k))

p(x^(j; k))(16)

In the above, we use p(x^(j; k)j~^s(j; k); ~A) = 1 byx^(j; k) = ~A~^s(j; k). Therefore, ~^s(j; k) becomes

~^s(j; k) = arg max~^s(j;k)

p(~^s(j; k))

s:t: x^(j; k) = ~A~^s(j; k) (17)

Now, setting p(sî(j; k)) / exp(jsî(j; k)j) under the as-sumption that the distributions of source signals are ofindependently Laplacian, we have

~^s(j; k) = arg min~^s(j;k)

nXi=1

jsî(j; k)j

s:t: x^(j; k) = ~A~^s(j; k) (18)

The optimization mentioned above can be solved just asa problem of linear programming. This procedure givesthe estimates of source signals, ~^s(j; k) containing at leastn m zero components. Eq.(18) is called an l1-normsolution.

The necessary condition for estimating the source sig-nals based on the l1-norm solution is just the followingsparsity:

Assumption 2 Assuming that the signal s^(j; k) haveat least m non-zero components for an arbitral(j; k), i.e., assuming that for an arbitral index setfi1; i2; :::; im+1g f1; 2; :::ng, the following equationholds:

sî1(j; k)sî2(j; k) sîm+1(j; k) = 0 ;8j; k (19)

5. SIMULATION RESULTS

In this section, we demonstrate the performance of theproposed decomposition algorithm and the improvementof the sparsity by applying the RI-Spline wavelet to realspeech signals.

For comparing with the RI-Spline wavelet, we usesan usual algorithm in the time domain and also a STFT-based algorithm in the TF domain. First, assuming thenumber of source signals n = 3 and the number of ob-served signals m = 2, we apply the proposed algorithmto three sound sources consisting of one males and twofemale recorded with the sampling frequency 8000Hz ina non-reverberant room.

i

i

a

a

TIMESTFTRIW

Fig. 4 Total number of index with i source signals beingzero

5.1 SparsityWe show the improvement of the sparsity by using the

RI-Spline wavelet.The aim for obtaining more sparse representation as

preprocessing is to make the signals satisfying the as-sumptions 1 and 2 in order to make the performance ofthe decomposition algorithm better. We examine changesin the number of indices satisfying the assumptions 1 and2 by using the RI-Spline wavelet. Setting fl1; :::; lig f1; 2; 3g, we dene i as

i = ft j 9l1; :::; li; sl1(t) = = sli(t) = 0g (20)(i = 0; 1; 2; 3)

where i is the set of index with i source signals be-ing zero, and 0 is that with no source signals beingzero. For the indix (j; k), we dene i in the same man-ner. Fig. 4 shows the values of #(i) (i = 1; 2; 3)which are computed by the following three algorithms:the time domain algorithm (TIME), the STFT-based algo-rithm (STFT) with the window size 64ms and the presentTF algorithm using the RI-Spline wavelet (RIW).

The assumption 1 requires the existence of 2. Asshown in Fig. 4, the value of #(2) by using TIME getslarger than that by using STFT, and the value of #(2)by using RIW gets larger than that by using STFT.

The set of index satisfying the assumption 2 is thecomplement of c0(= 1 [ 2 [ 3). Fig. 4 showsthat the value of #(c0) by using the TF domain algo-rithm gets smaller than that by using the time domainalgorithm. The value of #(c0) by using the RI-Splinewavelet gets smaller than that by using the STFT. There-fore, the RI-Spline wavelet approach is best as prepro-cessing for estimation of source signals.

5.2 Underdetermined BSSFor preprocessing, we carry out a simulation of un-

derdetermined BSS with the STFT approach and the RI-Spline wavelet approach presented in section 4. In anapplication of the STFT, the window sizes are set as

2793

Table 1 Estimation performance

SNR1 SNR2 SNR3RIW 11.4 12.0 18.4

STFT(16ms) 3.75 3.73 6.21STFT(32ms) 5.27 5.21 6.29STFT(64ms) 5.65 5.28 7.09STFT(128ms) 6.13 3.89 5.46

16; 32; 64; 128. The mixture matrixA is set as

A =0:3420 0:6428 0:76600:9397 0:7660 0:6428

(21)

We evaluate the estimation performance by using the in-dex dened as

SNRi = 20 log jjsi(t)jjjjs^i(t) si(t)jj

(22)

The estimation performance gets better as SNR getslarger.

The RI-Spline wavelet approach and the STFTapproach produce the estimates of mixture matricesA^W ; A^F , respectively,

A^W =0:3426 0:6415 0:76040:9395 0:7671 0:6494

(23)

A^F =0:3421 0:6417 0:76530:9397 0:7670 0:6437

(24)

where the norm of each estimated row vector is normal-ized to be 1. We obtained successful results in estimationof mixture matrices by comparing (21) and (23). How-ever, these two results (23), (24) are not so different be-cause the assumption 1 requires only the existence of 2and requires only

certain sparsity.Table 1 shows simulated results of SNRs for RIW and

STFT. We can see from this that SNRi with the RI-Splinewavelet is best. Despite the fact that the two approachesto the estimation of mixture matrix produce the similarperformance as shown in (23) and (24), there is a differ-ence in SNR between two approaches. This comes fromthe assumption 2. The estimation performance of sourcesignals depends on #(0). For use of the RI-Splinewavelet, we have a larger SNR with smaller#(0).

6. CONCLUSIONThis paper proposed an approach to the underde-

termined BSS based on time-frequency analysis usingthe RI-Spline wavelet. Through a simulation usingreal speech signals, it is claried that source signalsin the time domain can be transformed into their moresparse representation by using the RI-Spline wavelet-based time-frequency analysis. That is, we can in-crease the number of indices satisfying the assumptionsconcerning the sparsity and obtain the better estimationperformance, comparing with a conventional STFT ap-proach.

There remains the following two problems in the nearfuture: the rst is concerned with improvement of the

estimation performance so as to get more sparse repre-sentation by using the wavelet packet; the second is withdevelopment of a new algorithm which makes it possi-ble to relax the sparsity conditions which the estimationalgorithm requires.

REFERENCES

[1] Amari, S., Cichocki, A. and Yang, H. H.,

A newlearning algorithm for blind signal separation, InAdvances in neural information processing systems,Vol. 8, pp. 757-763, 1996.

[2] P. Boll and M. Zibulevsky,

Underdeterminedblind source separation using sparse representa-tions, IEEE Trans. Signal Process, vol. 81, no. 11,pp. 2353-2362, 2001.

[3] O. Yilmaz and S. Rickard,

Blind separationof speech mixtures via time-frequency masking,IEEE Trans. Signal Process, vol. 52, no. 7, pp.1830-1847, 2004.

[4] D. Napoletani, C.A. Berenstein and P.S. Krish-naprasad,

Quotient signal decomposition and orderestimation, Technical reserch report of Universityof Maryland, TR 2007-42.

[5] Lewicki, M. S. and Sejnowski, T. J.,

Learningovercomplete representations, Neural Computa-tion, vol. 12,no. 2, pp. 337-365. 2000.

[6] Z. Zhang, H. Fujiwara, H. Toda, H. Kawabata,

A New Complex Wavelet Transform by using RI-Spline Wavelet, IEEE International Conference onAcoustics, Speech and Signal Processing, pp. 937-940, 2004.

2794

Documents

Speech processing research paper 14