Apsipa2016for ss

H. Nakajima (UTokyo), D. Kitamura (SOKENDAI),

N. Takamune (UTokyo), S. Koyama (UTokyo), H. Saruwatari (UTokyo),

Y. Takahashi (Yamaha R&D), K. Kondo (Yamaha R&D)

Audio Signal Separation Using Supervised NMF with

Time-Variant All-Pole-Model-Based Basis Deformation

APSIPA2016 Organized Session on Advances in Acoustic Signal Processing

Nonnegative Matrix Factorization (NMF) [Lee, et al., 2001]

• Feature extraction based on low-rank representation

Amplitude

Am

plit

ud

e

Observation (spectrogram)

Basis matrix (frequently appeared spectrum)

Activation matrix (gain variation)

Time

𝑓 : frequency bin

𝑡 : time frame

k: # of bases

Time

Freq

uen

cy

Freq

uen

cy

𝑭 𝑮

𝑡

𝒀

𝑡

Extracted basis can be used for infromed source separation, e.g., music demixing, speech enhancement, etc.

• Source separation using target-signal basis (supervision)

Supervised NMF (SNMF) [Smaragdis, et al., 2007]

Basis trained using target-signal samples

Separation Estimate given supervised basis

Separated spectrogram

𝒀mix

Training

Objective of This Study

• Drawback of SNMF

→Accuracy decreases when variant trained basis is used.

We propose a new algorithm for deformation of trained basis to make it fit to open data.

Training

Separation

SNMF with Additive Basis Deformation (SNMF-ABD) [Kitamura, et al., 2013]

• Open-data adaptation by modifying supervised basis 𝑭 with additive term 𝑫

Signal model:

Many orthogonal penalty parameters are needed but uncontrollable.

Strong sensitivity to initial value

𝒀mix ≈ 𝑭 +𝑫 𝑮 +𝑯𝑼

𝑭

𝑯 𝑫

SNMF with Time-Invariant Basis Deformation (TID) [Nakajima, et al., EUSIPCO2016]

Training

Separation

Supervision

𝑭org

・Source separation and basis deformation are independently processed. ・Basis deformation is performed via target given by generalized MMSE-STSA estimator. ・Iterative basis deformation [Breithaupt, et al., 2008]

SNMF with Time-Invariant Basis Deformation (TID) [Nakajima, et al., EUSIPCO2016]

Training

Separation

Generation of target by generalized MMSE-STSA

estimator

Basis deformation

Supervision

𝑭org

Interference

𝒀mix − 𝑭𝑮

Estimated target 𝒀 Binary mask 𝑰

𝑭 ← 𝑨𝑭org

𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭org𝑮)

Hereafter we propose an improved algorithm introducing time variance.

Diagonal matrix with all-pole-model-based deformation

・Source separation and basis deformation are independently processed. ・Basis deformation is performed via target given by generalized MMSE-STSA estimator. ・Iterative basis deformation

To extract convincing 𝒀

[Breithaupt, et al., 2008]

Proposed Discriminative Time-Variant Deformation

① Supervised basis is classified to 2 parts, capturing time-variant nature. ② Exceeding deformation is avoided by discriminative training.

Training

Separation


estimator

Basis deformation

Supervision

𝑭org

Interference



𝑭 ← 𝑨𝑭org


Proposed Discriminative Time-Variant Deformation

Supervision

𝑭org= [𝑭atk, 𝑭sus]

𝑭 ← [𝑨𝑭atk, 𝑩𝑭sus]

① Supervised basis is classified to 2 parts, capturing time-variant nature. ② Exceeding deformation is avoided by discriminative training.

Training

Separation


estimator

Interference




Discriminative basis deformation considering

interference

①

②

Proposed ①: Time Variance in Instruments

Basis deformation model should be changed in accordance with difference in physical mechanism of articulation.

Ex: Piano articulation

String

Hammer

• Physical mechanism is different in Attack and Sustain in music instruments. [N. H. Fletcher, 1991]

Initial state Flip string (transitional)

Free vibration

Proposed ①: Basis Classification

• Bases is classified in accordance with frequency of attack and sustain generation.

• In each basis group, we apply difference deformation model.

≈ 𝑭org𝑮atk

≈ 𝑭org 𝑮sus

Classify 𝑭org into 𝑭1 and 𝑭2 based on k-means method

Frequency of attack part for each basis Frequency of sustain part for each basis

Truncate sustain part in training sample

Truncate attack part in training sample Time

Time Time

Proposed ①: Deformation Model

𝒀 : Estimated target by generalized MMSE-STSA estimator

𝑰 : Binary mask for sampling convincing components

𝑭𝟏 : Supervised basis trained using attack part only 𝑭𝟐 : Supervised basis trained using sustain part only

𝑨 : Diagonal matrix with all−pole−model spectrum to deform 𝑭𝟏

𝑩 : Diagonal matrix with all−pole−model spectrum to deform 𝑭𝟐

𝑮𝟏, 𝑮𝟐 : Activation matrices corresponding to 𝑭𝟏, 𝑭𝟐

: Hadamard product

𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭1𝑮𝟏 +𝑩𝑭2𝑮𝟐)

Deformation parameters

• We prepare different deformation models for attack and sustain.

Proposed ①: Parameter Update Cost function

based on KL div.

Parameter update

by auxiliary-

function method

Proposed ②：Discriminative Basis Deformation

• Large degree of freedom in A, B often allows to represent interference, resulting in deterioration of separation accuracy.

• Discriminative deformation can mitigate such side effects.

Formulation as Bilevel Optimization

→ 𝑨𝑭𝟏𝑮𝟏 + 𝑩𝑭𝟐𝑮𝟐 is hard to represent interference component in 𝒀.

Owing to this cost, target and interference components are separately modeled.

Target component Interference component

subject to 𝑮𝟏 ,𝑮𝟐 = arg min

𝑮𝟏,𝑮𝟐,𝑯,𝑼(𝑰 ∘ 𝒀mix|𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐 +𝑯𝑼))

𝑨,𝑩 = arg min𝑨,𝑩

(𝑰 ∘ 𝒀|𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐 )) Fitness for

target Y only

Fitness for mixture 𝒀mix

Unfortunately this problem is hard to be solved, so we propose an approximated solver algorithm.

Proposed ②：Approximated Algorithm

• Step 1: Initialization (the same as conventional one) min

𝑨,𝑮𝟏 ,𝑩,𝑮𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐 𝑮𝟐 )

• Step 2: Modeling of mixture Ymix min

𝑮𝟏,𝑮𝟐,𝑯,𝑼𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 + 𝑩𝑭𝟐𝑮𝟐 +𝑯𝑼))

• Step 3: Modeling of target Y

min𝑨,𝑩𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐))

Fixing basis deformation matrix, we estimate activation.

Fixing activation matrix, we estimate deformation matrix.

We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture.



𝑨,𝑮𝟏 ,𝑩,𝑮𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐 𝑮𝟐 )










𝑨,𝑮𝟏 ,𝑩,𝑮𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐 𝑮𝟐 )










𝑨,𝑮𝟏 ,𝑩,𝑮𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐 𝑮𝟐 )








Experimental Evaluation: Condition

Instruments Oboe (Ob.), Piano (Pf.), Trombone (Tb.)

Training (MIDI) Garritan Professional Orchestra

Open target (MIDI) Microsoft GS Wavetable SW Synth

Sampling freq. 44100 Hz

FFT length 4096 points (100 ms)

Shift length 512 points (15 ms)

# of bases Target: 100, Interference: 30

Truncation period for extraction of attack

50 ms

Comparison Conventional methods: SNMF, SNMF-ABD, TID

Proposed method

Evaluation score Signal-to-Distortion Ratio (SDR) [dB]

(for evaluating total quality of separated signal)

• Different MIDI generators were used for training and open data. • Source separation for 2-sound mixture using supervised basis.

Music Score Used in Experiment

・Open data (mixture)

・Training samples

Oboe

Piano

Trombone

Oboe

Piano

Trombone

• 2 octave chromatic scale

• Test song for NMF research [Kitamura, 2014]

Results 1: Example

Ex. Piano-sound extraction from mixture of oboe and piano

Better SDR rather than conventional methods

Results 2: Overall Evaluation

SNMF [dB]

SNMF-ABD [dB]

TID [dB]

Proposed [dB]

Ob. & Pf. 6.7 8.1 6.7 7.0

Ob. & Tb. 2.4 2.6 2.8 2.9

Pf. & Ob. 4.1 3.6 5.2 6.1

Pf. & Tb. 3.1 3.2 4.5 4.5

Tb. & Ob. 0.7 0.2 2.4 2.8

Tb. & Pf. 2.9 2.6 3.9 4.4

“A & B” means task for extraction of “A” from mixture of A and B.

SNMF-ABD: Basis deformation NMF in parallel with separation TID: Time-invariant deformation NMF without considering interference

Results 2: Overall Evaluation

SNMF [dB]

SNMF-ABD [dB]

TID [dB]

Proposed [dB]

Ob. & Pf. 6.7 8.1 6.7 7.0

Ob. & Tb. 2.4 2.6 2.8 2.9

Pf. & Ob. 4.1 3.6 5.2 6.1

Pf. & Tb. 3.1 3.2 4.5 4.5

Tb. & Ob. 0.7 0.2 2.4 2.8

Tb. & Pf. 2.9 2.6 3.9 4.4

Proposed method outperforms SNMF and TID in all combination.

In only one case, SNMF-ABD wins but loses in the other cases.

Conclusion

• In this study, we propose a new advanced SNMF that includes time-variant (attack & sustain) deformation of the trained basis to make it fit the target sound.

• Also, to avoid the exceeding deformation, we propose a discriminative basis deformation. In order to solve the bilevel optimization problem, we introduce an approximated algorithm.

• From the experimental results, it was confirmed that the proposed method outperforms the conventional methods in many cases.

Thank you for your attention!

Engineering

Apsipa2016for ss