[IEEE 2013 IEEE Conference on Wireless Sensor (ICWISE) - Kuching, Sarawak, Malaysia (2013.12.2-2013.12.4)] 2013 IEEE Conference on Wireless Sensor (ICWISE) - β-divergence two-dimensional

β-Divergence Two-Dimensional Sparse Nonnegative Matrix Factorization for Audio Source Separation

A.M. Darsono, N. Z. Haron, A. S. Jaafar Faculty of Electronics & Computer Engineering,

Universiti Teknikal Malaysia Melaka Melaka, Malaysia

{abdmajid, zaidi, shukur}@utem.edu.my

M. I. Ahmad School of Computer & Communication Engineering

Universiti Malaysia Perlis Perlis, Melaka

[email protected]

Abstract—In this paper, a novel sparse two dimensional nonnegative matrix factorization (SNMF2D) with the β-divergence is proposed. In SNMF2D, the time-frequency (TF) profile of each source is modeled as two-dimensional convolution of the temporal code and the spectral basis. Sparsity constraint was imposed to reduce the ambiguity and provide uniqueness to the solution. The proposed model maximises the joint probability of the mixing spectral basis and temporal codes conditioned on the mixed signal using multiplicative update rules. Experimental tests have been conducted in audio application to blindly separate the source in musical mixture. Results have concretely shown the efficacy of the algorithm in separating the audio sources from single channel mixture.

Keywords-Blind Source separation; nonnegative matrix factorization; multiplicative update rule; sparse feature.

I. INTRODUCTION Nonnegative matrix factorization (NMF) [1] has become

one of the promising and exciting techniques in signal processing. NMF has been successfully applied in various applications such as in automatic music transcription [2], cryptography [3], pattern recognition [4] and etc. One of the most useful property of NMF is that the nonnegative constraint by itself enforcing the sparse representation of the data. This representation makes the encoded data easy to be estimated because data was encoded by using only a few active components. For NMF, a popular cost function are the Least Square distance, the Kullback-Leibler (KL) divergence and the Itakura-Saito (IS) divergence are often selected. It can be enclosed in the more general framework of β-divergence [5, 6]. β-divergence can be defined by

( )( ) { }

( ) ( )

⎪⎪⎪

⎩

⎪⎪⎪

⎨

⎧

=−−

=−+−

ℜ∈−

−+−

=

−

0 1log

1 loglog

1,0\ 11

1

β

β

βββββ

βββ

β

xy

xy

yxxyy

yxxy

xyd (1)

where ( )dβ ⋅ ⋅ is the scalar cost function. The limiting cases 0,1 and 2β = correspond to the IS divergence, KL divergence

and LS distant, respectively. Particularly it underlie the

multiplicative Gamma observation noise, Poisson noise and Gaussian additive observation noise, respectively.[ ].

The recent developed two dimensional NMF (NMF2D) model [7] extends the NMF model in order to provide decomposition that can capture the temporal dependency of the frequency patterns within the source efficiently. In NMF2D, the time-frequency (TF) profile of each source is modeled as two-dimensional convolution of the temporal code and the spectral basis. This significantly reduces the number of components per source needed in the decomposition. So far, for NMF2D, there is no research work has been done to apply the general framework of β-divergence. This paper carried out to accommodate the β-divergence cost function in NMF2D model and investigate the effect of β in the performance of the algorithm. To further improve the algorithm, this paper proposed a sparseness constraint to be imposed in the cost function to reduce the ambiguity the ambiguity associated with the estimation of the spectral basis and temporal codes.

The remainder of the paper is organized as follows: Single channel mixture model in the TF domain is introduced in Section II. The derivation of proposed separation technique of sparse β-divergence two dimensional NMF is detailed in Section III. Section III presents the results of experimental tests as well as the analysis. Finally, Section V concludes the paper.

II. SINGLE CHANNEL MIXTURE The problem of single channel source separation (SCSS) is

treated as one observation mixed with several unknown sources. In time domain, SCSS can be model as:

1( ) ( ) ( )

J

jj

y t x t e t=

= +∑ (2)

where j=1,2,..,J denotes number of sources and ( )e t is an additive noise. From (2), the mixture are then projected to time-frequency domain using short-time Fourier transform (STFT) function such that

, , , ,1

J

f n j f n f nj

y x e=

= +∑ (3)

2013 IEEE Conference on Wireless Sensors (ICWiSe2013), December 2 - 4, 2013, Kuching, Sarawak

978-1-4799-1576-7/13/$31.00 ©2013 IEEE 119

where ,f ny and , ,j f nx are the time-frequency components of the corresponding time signals. Here, frequencies are given by f=1,2,...,F while time slots are given by n=1,2,...,N. F and N denotes the frequency bin and time frame index respectively.

Then the power spectrogram is defined as the squared magnitude of (3) which can be expressed as

,

2 2

, , , , , , , ,1 1 1

2 Ref n

J K J

f n j f n k f n j f n f nj k j

y x x x e e∗ ∗

= = =

⎡ ⎤= + +⎣ ⎦∑∑ ∑ (4)

Assuming the windowed disjoint orthogonality (WDO) of the sources and the noise i.e. , , , , 0j f n k f nx x∗ = and

,, , 0f nj f nx e∗ = for

all f and n with j k≠ , (4) can be written as

2 2 2

, , , ,1

J

f n j f n f nj

y x e=

= +∑ (5)

In the matrix form, model (5) can be rewritten such that

.2.2 2

1

J

jj

⋅

=

= +∑Y X E (6)

where { }2.2,f n

fny=Y , { }.2 2

, ,j j f njfn

x=X and { }22,f n

fne⋅ =E .

The superscript “.” is element-wise operation. In single channel blind source separation, given the power spectrogram of observed mixture, .2Y , we are interested in estimating the

sources, jX .

III. PROPOSED SEPARATION METHOD In this section, a new algorithm of two-dimensional sparse

nonnegative matrix factorization using the sparse β-divergence NMF2D will be developed. The algorithm optimizes the parameters of the signal model based on the multiplicative update rule using gradient descent. To facilitate the derivation of these algorithms, let first consider the signal model in terms of the power time-frequency (TF) representation.

A. Source Model Since we are dealing with single channel audio mixing, a

natural time-frequency representation named log-frequency spectrogram will be generated by using the constant-Q transform [8]. Compared with classic spectrogram where signals are decomposed to components of linearly spaced frequencies, the log-frequency spectrogram is more suitable for audio application especially for modern music since it provides desirable resolution that is geometrically related to the frequency.

From (6) and considering the noise free environment, the

(log-frequency) power spectrogram of each jth source .2

jX is defined as a product of two nonnegative matrices of source signal model as follows:

max max.2

0 0j jj

τ φ φ τφτ

τ φ

↓ →

= =

≈∑∑X HW (7)

The matrix τW represents the thτ slice spectral basis and φH represents the thφ slice of temporal code for each spectral basis

element. The superscript upper arrow sign in φτ

↓

W denotes downward shift operator which moves each element in the matrix by φ row down. By the same token, the arrow sign

inτφ

→

H denotes the right shift operator which moves each element in the matrix by τ column to the right. This can be interpreted as follows, i.e:

1 21 2 3 0 0 0 0 0 11 2 3 , 1 2 3 , 0 0 11 2 3 1 2 3 0 0 1

↓ →⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥= = =⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦

Β Β Β

For two-dimensional nonnegative matrix factorization (NMF2D) model in (7), the factorization is based on a model that represents temporal structure and pitch change. In audio source separation, the model represents each instrument by a single time-frequency profile convolved in both time and frequency by a time-pitch weight matrix. This model radically reduces the number components need to model various instruments and effectively solves the SCSS problem. In the following, novel algorithm of sparse NMF2D with β-

divergence is proposed to estimate the parameter of j

φτ

↓

W and

j

τφ

→

H from the mixture.

B. Cost Function with Sparseness Constraint Sparse representation is a representation of data where most

coefficients are zero. It is proving to be a particularly interesting and powerful tool especially for analysis and processing of audio signals. If each signal to be separated has a sparse representation, then there is a good chance that there will be little overlap between the small sets of coefficients used to represent the different source signals. Therefore by selecting the coefficients used by each source signal, we can restore each of the original signals with most of the interference from the unwanted signals removed.

Now, we incorporated the β-divergence as defined in (1) with the sparsity constrained such that it will minimize the cost function as follow:

( ) ( )( )

( ) ( ) ( )12 2

,.2 ,

, 1 1f n f n

f nC f

β ββ

λβ β β β

−⎛ ⎞⎜ ⎟= + − +⎜ ⎟− −⎜ ⎟⎝ ⎠

∑Y Y PP

Y P H (8)

for 1,...,f F= , 1,...,n N= where , ,

jjj

φ τφτ

τ φ

↓ →

= ∑P HW with

( )2

, , ,,

f j f j f jf

τ τ τ

τ= ∑W W W .Parameter λ is the sparsity constraint


120

and ( )f H can be any function with positive derivative such as

( )0L normς ς= > given by ( )1

., ,

j nj n

fς

φς

φ

⎛ ⎞= = ⎜ ⎟

⎝ ⎠∑H H H . This

will resolve the ambiguity between the factors by imposing sparseness on φH and forcing the structure onto τW .

In this paper, we employed the multiplicative gradient descent approach which consists in updating each parameter by multiplying its value at the previous iteration by a certain coefficient. The derivatives of (8) corresponding to τW and

φH of β-SNMF2D are given by:

( )( )

( ) ( ) ( )

( ) ( )( )

12 2,, , ,

' ',', ' ', '

1 22' , ' , ', '' ,

,

1 1

f nf n f n f n

f nf j f j

f n f n j nf nn

Cf

β ββ

βτ τ

β β φφ φ τφ

φ

λβ β β β

−

− −

+ + −+

⎛ ⎞⎛ ⎞∂ ⎜ ⎟∂ ⎜ ⎟= + − +⎜ ⎟⎜ ⎟− −∂ ∂ ⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠

= −

∑

∑

Y Y PPH

W W

P Y P H

(9)

( )( )

( ) ( ) ( )

( ) ( )( ) ( )

12 2,, , ,

' ',', ' ', '

1 22', ' , ' , ' ', '

, ', '

1 1

f nf n f n f n

f nj n j n

f j f n f nf nf j n

Cf

f

β ββ

βφ φ

β βτφ τ τ φτ

τ

λβ β β β

λ

−

− −

− + ++

⎛ ⎞⎛ ⎞∂ ⎜ ⎟∂ ⎜ ⎟= + − +⎜ ⎟⎜ ⎟− −∂ ∂ ⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠

∂= − +

∂

∑

∑

Y Y PPH

H H

HW P Y P

H

(10)

Thus, by applying the standard gradient descent approach:

' '', ' ', ' '

', '

' '', ' ', ' '

', '

f j f j Wf j

j n j n Hj n

C

C

βτ ττ

βφ φφ

η

η

∂← −

∂

∂← −

∂

W WW

H HH

(11)

where Wη and Hη are positive learning rates which can be obtained by following [1], namely:

( )

( ) ( )

'', '

1

' , ', ',

'', '

1

', ' , ' ', ', '

f jW

f n j nn

j nH

f j f nf j n

f

τ

β φφ τ

φ

φ

βτφ τ φ

τ

η

ηλ

−

+ −

−

− +

=

=∂

+∂

∑

∑

W

P H

HH

W PH .

(12)

Thus, the multiplicative learning rules become:

( )

T ( 2).2

T ( 1) f

β τφ ττ

τφ φ

βφ ττ

φτλ

− ←↓ ←

−↓ ←

⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠←

∂⎛ ⎞ +⎜ ⎟ ∂⎝ ⎠

∑

∑

W P Y

H HH

W PH

i

i (13)

( 2) T.2

( 1) T

β φ τφφ

φ

τ τβ τφ

φφ

− ↑ →↑

− →↑

⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠←

⎛ ⎞⎜ ⎟⎜ ⎟⎝ ⎠

∑

∑

P Y H

W W

P H

i

i (14)

For (13) and (14), A Bi denotes element wise multiplication

and AB

denotes the element wise division. Table I presents the

main steps of the proposed algorithm for blind separation using sparse β-SNMF2D.

C. Reconstruction of theSeparated Sources

From mixture .2Y , we seek the two estimated sources which

are max max.2

1 1 10 0

ˆφτ φ ττ φ

τ φ

↓ →

= ==∑∑X W H and

max max.22 2 2

0 0

ˆφτ φ ττ φ

τ φ

↓ →

= ==∑∑X W H . Then,

by using binary masking technique [9] and defining the masking matrix as { }, , 1,..., 1,...,j j f nm f F and j J= = =M where

2 2

, , , ,, ,

ˆ ˆ1,

0, j f n k f n

j f nif x xm

Otherwise

⎧ >⎪= ⎨⎪⎩

(15)

the time domain estimated signal ˆ jx is obtained by

resynthesizing jM with the mixture .2Y i.e.

( ).2ˆ resynthesizej j •=x M Y . Here, ‘resynthesize’ signifies

the inverse mapping of log-frequency axis to the original frequency axis and then followed by inverse STFT back to the time domain.

TABLE I. ALGORITHM OF ΒETA DIVERGENCE SNMF2D

1. Initialize τW and φH with nonnegative random values.

2. Normalize ( )2

, , ,,

f j f j f jf

τ τ τ

τ= ∑W W W

3. Compute , ,

j jj

φ ττ φ

τ φ

↓ →

= ∑Ρ W H

4. Update ( )

T ( 2).2

T ( 1) f

β τφ ττ

τφ φ

βφ ττ

φτλ

− ←↓ ←

−↓ ←

⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠←∂⎛ ⎞

+⎜ ⎟ ∂⎝ ⎠

∑

∑

W P Y

H HH

W PH

i

i

5. Compute , ,

j jj

φ ττ φ

τ φ

↓ →

= ∑Ρ W H

6. Update

( 2) T.2

( 1) T

β φ τφφ

φτ τ

β τφφ

φ

− ↑ →↑

− →↑

⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠←⎛ ⎞⎜ ⎟⎜ ⎟⎝ ⎠

∑

∑

P Y H

W W

P H

i

i

7. Repeat from steps 2 to 6 until convergence.


121

IV. EXPERIMENTS & ANALYSIS The proposed algorithm is tested on audio signals

containing synthetic piano sound and trumpet sound. The mixture is approximately 5s long and sampled at 16 kHz. In this experiment, STFT using 2048-point Hamming window with 50% overlap was used and this gives 175 frequency bins in the log-frequency spectrogram within the range of 50Hz to 8kHz with 24 bins per octave. This corresponds to twice the resolution of the equal source signal scale. Figure 1 shows the original TF domains of the source of piano and trumpet as well as its mixture. For audio separation, after conducting the Monte-Carlo experiments over 50 independent realizations of the mixture, the parameters of the convolutive factors of τ and φ shifts are set to be max 8τ = and max 32φ = . This is the best attainable parameter setting to represent the temporal code and spectral basis in the factorization for most of music signals. To evaluate the proposed algorithm, the performance will be measured using the signal-to-distortion ratio (SDR), source-to-artifacts ratio (SAR) and source-to-interference ratio (SIR) which measures an overall sound quality of the source separation. The MATLAB implementation of these measures can be found in [10]

A. Effect of β In this sub-section, we investigate the effect of β in terms of

performance of the proposed algorithm. Figure 2 shows the average SDR values obtained from various values of β using multiplicative update NMF2D algorithm. The value of β tested was varied from 0 to 2 in steps of 0.1. It ought to cover Least Square (LS) distant, the Kullback-Leibler (KL) divergence and the Itakura-Saito (IS) divergence of NMF2D. The average separation performance was obtained from the estimated SDR value for each source in a piano-trumpet mixture, thereby providing a measure of overall separation for each signal. From Figure 2, as we increase the value of β, the performance also increase and it reach its peak value when β=0.9 with average SDR value of 13.5dB is obtained for each source. A tail-off in performance occurs as the value of β increases from 0.9 goes up to 2. From this experiment, it suggests that β around 0.9 is an optimal value for audio separation which will be used in our experiment in the next sub-section.

B. Audio Source Separation Results In this sub-section, we compare the performance of audio

source separation of proposed algorithms of sparse β-divergence NMF2D with the one without the sparsity constraint. We set β=0.9 for both algorithm. The best value of sparsity parameter λ was identified as 0.5 after conducting the Monte-Carlo experiments over 50 independent realizations. L1-norm regularization is used to resolve the ambiguity by forcing all structure in H onto D . Figure 3 shows the separation result in log-frequency spectrogram for both algorithms. Compared with original sources in Figure 1, it is visually clear that separation of β-divergence NMF2D without the sparseness in Figure 3(A) and Figure 3(B) led to poor result since the factorization still contains the mixed signal (indicated by the

(A)

(B)

(C)

Figure 1 Log-frequency spectrogram of (A) piano, (B) trumpet and (C)

convolutive mixed signal.

Figure 2 Separation results for various values of β using β-divergence

NMF2D

box marked area). This is because without the sparsity constraint, it leads to component ambiguity, i.e. lack of uniqueness in decomposition. In contrary, by employing the sparseness, it has yielded the better performance when the decomposition of spectral bases and temporal codes is performed with the sparsity constraint.

From Table II, in general both algorithms of β-divergence NMF2D provide decent results with the performance of SDR, SIR and SAR that can be considered good. Over 10 dB of SDR measurement have been recorded for both methods. However, performance of β-divergence NMF2D with sparsity constraint is superior with the average SDR improvement of 2.7 dB per source compare with the one without imposing the sparseness. In percentage, this translates to an average improvement of 20%. From this result, it can be inferred that the sparsity constraint have significant effects on the separation performance.


122

(A)

(B)

(C)

(D)

Figure 3 Separated sound in log-frequency spectrogram for β-divergence NMF2D (A)-(B) piano and trumpet sound without sparsity (C)-(D) piano and trumpet sound with sparsity.

V. CONCLUSION The use of the β-divergence for audio source separation

using NMF2D model has been investigated. The value of β-divergence with β=0.5 was found to produce an optimal result. Furthermore, the proposed sparse β-divergence NMF2D is developed under the probabilistic framework which enables sparseness to be incorporated in the solution. This will significantly resolve the ambiguities problem in the factorization. We confirmed through an experiment that the proposed algorithm performs exceptionally well in separation of an audio mixture with the high value of SDR has been achieved.

TABLE II SEPARATION RESULTS FOR NMF2D WITH ΒETA DIVERGENCE

Algorithms

Separated piano Separated trumpet

SDR SIR SAR SDR SIR SAR β-NMF2D

14.2 21.2 14.7 12.7 16.2 13.2

Sparse β-NMF2D 16.8 23.1 17.9 15.5 23.1 16.6

ACKNOWLEDGMENT The authors would like to thank Universiti Teknikal Malaysia Melaka for the research grant funding PJP/2012/FKEKK (15D)/S01127.

REFERENCES [1] D. Lee and H. Seung, “Learning the parts of objects by nonnegative

matrix factorisation,” Nature, vol. 401, no. 6755, pp. 788–791, 1999. [2] D. FitzGerald, “Automatic drum transcription and source separation,”

Ph.D. thesis, Dublin Institute of Technology, Dublin, Ireland, 2004. [3] S. Xie, Z. Yang, and Y. Fu, “Nonnegative matrix factorization applied to

nonlinear speech and image cryptosystems,” IEEE Trans. on Circuits and Systems I, vol. 55, no. 8, pp. 2356-2367, Sep 2008.

[4] I. Biciu, N. Nikolaidis, and I. Pitas, “Nonnegative matrix factorization in polynomial feature space”, IEEE Trans. On Neural Network, vol. 19, pp. 1090-1100, 2007.

[5] R. Kompass. “A generalized divergence measure for non-negative matrix factorization”. In Neuroinformatics workshop, Torun, Poland, Sept. 2005.

[6] C. F´evotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factorization with the Itakura-Saito divergence. with application to music analysis,” Neural Computation, vol. 21, pp. 793–830, Mar. 2009.

[7] M. Morup and M. N. Schmidt, “Sparse nonnegative matrix factor 2-D deconvolution,” Techical Report, Technical University of Denmark, Copenhagen, Denmark, 2006.

[8] Judith C. Brown, “Calculation of a constant Q spectral transform”, J. Acoust. Soc. Am., vol. 89, no 1, pp. 425-434, 1991.

[9] D. L. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, P. Diventi, Ed. Norwell, MA: Kluwer, pp.181-197, 2005.

[10] C. F´evotte, R. Gribonval and E. Vincent, “BSS EVAL Toolbox User Guide”, IRISA Technical Report 1706, Rennes, France, April 2005. http://www.irisa.fr/metiss/bss eval/.


123

Documents

[IEEE 2013 IEEE Conference on Wireless Sensor (ICWISE) - Kuching, Sarawak, Malaysia (2013.12.2-2013.12.4)] 2013 IEEE Conference on Wireless Sensor (ICWISE) - β-divergence two-dimensional