24
H. Nakajima (UTokyo), D. Kitamura (SOKENDAI), N. Takamune (UTokyo), S. Koyama (UTokyo), H. Saruwatari (UTokyo), Y. Takahashi (Yamaha R&D), K. Kondo (Yamaha R&D) Audio Signal Separation Using Supervised NMF with Time-Variant All-Pole-Model-Based Basis Deformation APSIPA2016 Organized Session on Advances in Acoustic Signal Processing

Apsipa2016for ss

Embed Size (px)

Citation preview

Page 1: Apsipa2016for ss

H. Nakajima (UTokyo), D. Kitamura (SOKENDAI),

N. Takamune (UTokyo), S. Koyama (UTokyo), H. Saruwatari (UTokyo),

Y. Takahashi (Yamaha R&D), K. Kondo (Yamaha R&D)

Audio Signal Separation Using Supervised NMF with

Time-Variant All-Pole-Model-Based Basis Deformation

APSIPA2016 Organized Session on Advances in Acoustic Signal Processing

Page 2: Apsipa2016for ss

Nonnegative Matrix Factorization (NMF) [Lee, et al., 2001]

• Feature extraction based on low-rank representation

Amplitude

Am

plit

ud

e

Observation (spectrogram)

Basis matrix (frequently appeared spectrum)

Activation matrix (gain variation)

Time

𝑓 : frequency bin

𝑡 : time frame

k: # of bases

Time

Freq

uen

cy

Freq

uen

cy

𝑭 𝑮

𝑡

𝒀

𝑡

Extracted basis can be used for infromed source separation, e.g., music demixing, speech enhancement, etc.

Page 3: Apsipa2016for ss

• Source separation using target-signal basis (supervision)

Supervised NMF (SNMF) [Smaragdis, et al., 2007]

Basis trained using target-signal samples

Separation Estimate given supervised basis

Separated spectrogram

𝒀mix

Training

Page 4: Apsipa2016for ss

Objective of This Study

• Drawback of SNMF

→Accuracy decreases when variant trained basis is used.

We propose a new algorithm for deformation of trained basis to make it fit to open data.

Training

Separation

Page 5: Apsipa2016for ss

SNMF with Additive Basis Deformation (SNMF-ABD) [Kitamura, et al., 2013]

• Open-data adaptation by modifying supervised basis 𝑭 with additive term 𝑫

Signal model:

Many orthogonal penalty parameters are needed but uncontrollable.

Strong sensitivity to initial value

𝒀mix ≈ 𝑭 +𝑫 𝑮 +𝑯𝑼

𝑭

𝑯 𝑫

Page 6: Apsipa2016for ss

SNMF with Time-Invariant Basis Deformation (TID) [Nakajima, et al., EUSIPCO2016]

Training

Separation

Supervision

𝑭org

・Source separation and basis deformation are independently processed. ・Basis deformation is performed via target given by generalized MMSE-STSA estimator. ・Iterative basis deformation [Breithaupt, et al., 2008]

Page 7: Apsipa2016for ss

SNMF with Time-Invariant Basis Deformation (TID) [Nakajima, et al., EUSIPCO2016]

Training

Separation

Generation of target by generalized MMSE-STSA

estimator

Basis deformation

Supervision

𝑭org

Interference

𝒀mix − 𝑭𝑮

Estimated target 𝒀 Binary mask 𝑰

𝑭 ← 𝑨𝑭org

𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭org𝑮)

Hereafter we propose an improved algorithm introducing time variance.

Diagonal matrix with all-pole-model-based deformation

・Source separation and basis deformation are independently processed. ・Basis deformation is performed via target given by generalized MMSE-STSA estimator. ・Iterative basis deformation

To extract convincing 𝒀

[Breithaupt, et al., 2008]

Page 8: Apsipa2016for ss

Proposed Discriminative Time-Variant Deformation

① Supervised basis is classified to 2 parts, capturing time-variant nature. ② Exceeding deformation is avoided by discriminative training.

Training

Separation

Generation of target by generalized MMSE-STSA

estimator

Basis deformation

Supervision

𝑭org

Interference

𝒀mix − 𝑭𝑮

Estimated target 𝒀 Binary mask 𝑰

𝑭 ← 𝑨𝑭org

𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭org𝑮)

Page 9: Apsipa2016for ss

Proposed Discriminative Time-Variant Deformation

Supervision

𝑭org= [𝑭atk, 𝑭sus]

𝑭 ← [𝑨𝑭atk, 𝑩𝑭sus]

① Supervised basis is classified to 2 parts, capturing time-variant nature. ② Exceeding deformation is avoided by discriminative training.

Training

Separation

Generation of target by generalized MMSE-STSA

estimator

Interference

𝒀mix − 𝑭𝑮

Estimated target 𝒀 Binary mask 𝑰

𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭org𝑮)

Discriminative basis deformation considering

interference

Page 10: Apsipa2016for ss

Proposed ①: Time Variance in Instruments

Basis deformation model should be changed in accordance with difference in physical mechanism of articulation.

Ex: Piano articulation

String

Hammer

• Physical mechanism is different in Attack and Sustain in music instruments. [N. H. Fletcher, 1991]

Initial state Flip string (transitional)

Free vibration

Page 11: Apsipa2016for ss

Proposed ①: Basis Classification

• Bases is classified in accordance with frequency of attack and sustain generation.

• In each basis group, we apply difference deformation model.

≈ 𝑭org𝑮atk

≈ 𝑭org 𝑮sus

Classify 𝑭org into 𝑭1 and 𝑭2 based on k-means method

Frequency of attack part for each basis Frequency of sustain part for each basis

Truncate sustain part in training sample

Truncate attack part in training sample Time

Time Time

Page 12: Apsipa2016for ss

Proposed ①: Deformation Model

𝒀 : Estimated target by generalized MMSE-STSA estimator

𝑰 : Binary mask for sampling convincing components

𝑭𝟏 : Supervised basis trained using attack part only 𝑭𝟐 : Supervised basis trained using sustain part only

𝑨 : Diagonal matrix with all−pole−model spectrum to deform 𝑭𝟏

𝑩 : Diagonal matrix with all−pole−model spectrum to deform 𝑭𝟐

𝑮𝟏, 𝑮𝟐 : Activation matrices corresponding to 𝑭𝟏, 𝑭𝟐

: Hadamard product

𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭1𝑮𝟏 +𝑩𝑭2𝑮𝟐)

Deformation parameters

• We prepare different deformation models for attack and sustain.

Page 13: Apsipa2016for ss

Proposed ①: Parameter Update Cost function

based on KL div.

Parameter update

by auxiliary-

function method

Page 14: Apsipa2016for ss

Proposed ②:Discriminative Basis Deformation

• Large degree of freedom in A, B often allows to represent interference, resulting in deterioration of separation accuracy.

• Discriminative deformation can mitigate such side effects.

Formulation as Bilevel Optimization

→ 𝑨𝑭𝟏𝑮𝟏 + 𝑩𝑭𝟐𝑮𝟐 is hard to represent interference component in 𝒀.

Owing to this cost, target and interference components are separately modeled.

Target component Interference component

subject to 𝑮𝟏 ,𝑮𝟐 = arg min

𝑮𝟏,𝑮𝟐,𝑯,𝑼(𝑰 ∘ 𝒀mix|𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐 +𝑯𝑼))

𝑨,𝑩 = arg min𝑨,𝑩

(𝑰 ∘ 𝒀|𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐 )) Fitness for

target Y only

Fitness for mixture 𝒀mix

Unfortunately this problem is hard to be solved, so we propose an approximated solver algorithm.

Page 15: Apsipa2016for ss

Proposed ②:Approximated Algorithm

• Step 1: Initialization (the same as conventional one) min

𝑨,𝑮𝟏 ,𝑩,𝑮𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐 𝑮𝟐 )

• Step 2: Modeling of mixture Ymix min

𝑮𝟏,𝑮𝟐,𝑯,𝑼𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 + 𝑩𝑭𝟐𝑮𝟐 +𝑯𝑼))

• Step 3: Modeling of target Y

min𝑨,𝑩𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐))

Fixing basis deformation matrix, we estimate activation.

Fixing activation matrix, we estimate deformation matrix.

We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture.

Page 16: Apsipa2016for ss

Proposed ②:Approximated Algorithm

• Step 1: Initialization (the same as conventional one) min

𝑨,𝑮𝟏 ,𝑩,𝑮𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐 𝑮𝟐 )

• Step 2: Modeling of mixture Ymix min

𝑮𝟏,𝑮𝟐,𝑯,𝑼𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 + 𝑩𝑭𝟐𝑮𝟐 +𝑯𝑼))

• Step 3: Modeling of target Y

min𝑨,𝑩𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐))

Fixing basis deformation matrix, we estimate activation.

Fixing activation matrix, we estimate deformation matrix.

We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture.

Page 17: Apsipa2016for ss

Proposed ②:Approximated Algorithm

• Step 1: Initialization (the same as conventional one) min

𝑨,𝑮𝟏 ,𝑩,𝑮𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐 𝑮𝟐 )

• Step 2: Modeling of mixture Ymix min

𝑮𝟏,𝑮𝟐,𝑯,𝑼𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 + 𝑩𝑭𝟐𝑮𝟐 +𝑯𝑼))

• Step 3: Modeling of target Y

min𝑨,𝑩𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐))

We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture.

Fixing basis deformation matrix, we estimate activation.

Fixing activation matrix, we estimate deformation matrix.

Page 18: Apsipa2016for ss

Proposed ②:Approximated Algorithm

• Step 1: Initialization (the same as conventional one) min

𝑨,𝑮𝟏 ,𝑩,𝑮𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐 𝑮𝟐 )

• Step 2: Modeling of mixture Ymix min

𝑮𝟏,𝑮𝟐,𝑯,𝑼𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 + 𝑩𝑭𝟐𝑮𝟐 +𝑯𝑼))

• Step 3: Modeling of target Y

min𝑨,𝑩𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐))

Fixing basis deformation matrix, we estimate activation.

Fixing activation matrix, we estimate deformation matrix.

We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture.

Page 19: Apsipa2016for ss

Experimental Evaluation: Condition

Instruments Oboe (Ob.), Piano (Pf.), Trombone (Tb.)

Training (MIDI) Garritan Professional Orchestra

Open target (MIDI) Microsoft GS Wavetable SW Synth

Sampling freq. 44100 Hz

FFT length 4096 points (100 ms)

Shift length 512 points (15 ms)

# of bases Target: 100, Interference: 30

Truncation period for extraction of attack

50 ms

Comparison Conventional methods: SNMF, SNMF-ABD, TID

Proposed method

Evaluation score Signal-to-Distortion Ratio (SDR) [dB]

(for evaluating total quality of separated signal)

• Different MIDI generators were used for training and open data. • Source separation for 2-sound mixture using supervised basis.

Page 20: Apsipa2016for ss

Music Score Used in Experiment

・Open data (mixture)

・Training samples

Oboe

Piano

Trombone

Oboe

Piano

Trombone

• 2 octave chromatic scale

• Test song for NMF research [Kitamura, 2014]

Page 21: Apsipa2016for ss

Results 1: Example

Ex. Piano-sound extraction from mixture of oboe and piano

Better SDR rather than conventional methods

Page 22: Apsipa2016for ss

Results 2: Overall Evaluation

SNMF [dB]

SNMF-ABD [dB]

TID [dB]

Proposed [dB]

Ob. & Pf. 6.7 8.1 6.7 7.0

Ob. & Tb. 2.4 2.6 2.8 2.9

Pf. & Ob. 4.1 3.6 5.2 6.1

Pf. & Tb. 3.1 3.2 4.5 4.5

Tb. & Ob. 0.7 0.2 2.4 2.8

Tb. & Pf. 2.9 2.6 3.9 4.4

“A & B” means task for extraction of “A” from mixture of A and B.

SNMF-ABD: Basis deformation NMF in parallel with separation TID: Time-invariant deformation NMF without considering interference

Page 23: Apsipa2016for ss

Results 2: Overall Evaluation

SNMF [dB]

SNMF-ABD [dB]

TID [dB]

Proposed [dB]

Ob. & Pf. 6.7 8.1 6.7 7.0

Ob. & Tb. 2.4 2.6 2.8 2.9

Pf. & Ob. 4.1 3.6 5.2 6.1

Pf. & Tb. 3.1 3.2 4.5 4.5

Tb. & Ob. 0.7 0.2 2.4 2.8

Tb. & Pf. 2.9 2.6 3.9 4.4

Proposed method outperforms SNMF and TID in all combination.

In only one case, SNMF-ABD wins but loses in the other cases.

Page 24: Apsipa2016for ss

Conclusion

• In this study, we propose a new advanced SNMF that includes time-variant (attack & sustain) deformation of the trained basis to make it fit the target sound.

• Also, to avoid the exceeding deformation, we propose a discriminative basis deformation. In order to solve the bilevel optimization problem, we introduce an approximated algorithm.

• From the experimental results, it was confirmed that the proposed method outperforms the conventional methods in many cases.

Thank you for your attention!