Upload
saruwatarilabutokyo
View
333
Download
0
Embed Size (px)
Citation preview
H. Nakajima (UTokyo), D. Kitamura (SOKENDAI),
N. Takamune (UTokyo), S. Koyama (UTokyo), H. Saruwatari (UTokyo),
Y. Takahashi (Yamaha R&D), K. Kondo (Yamaha R&D)
Audio Signal Separation Using Supervised NMF with
Time-Variant All-Pole-Model-Based Basis Deformation
APSIPA2016 Organized Session on Advances in Acoustic Signal Processing
Nonnegative Matrix Factorization (NMF) [Lee, et al., 2001]
• Feature extraction based on low-rank representation
Amplitude
Am
plit
ud
e
Observation (spectrogram)
Basis matrix (frequently appeared spectrum)
Activation matrix (gain variation)
Time
𝑓 : frequency bin
𝑡 : time frame
k: # of bases
Time
Freq
uen
cy
Freq
uen
cy
𝑭 𝑮
𝑡
𝒀
𝑡
Extracted basis can be used for infromed source separation, e.g., music demixing, speech enhancement, etc.
• Source separation using target-signal basis (supervision)
Supervised NMF (SNMF) [Smaragdis, et al., 2007]
Basis trained using target-signal samples
Separation Estimate given supervised basis
Separated spectrogram
𝒀mix
Training
Objective of This Study
• Drawback of SNMF
→Accuracy decreases when variant trained basis is used.
We propose a new algorithm for deformation of trained basis to make it fit to open data.
Training
Separation
SNMF with Additive Basis Deformation (SNMF-ABD) [Kitamura, et al., 2013]
• Open-data adaptation by modifying supervised basis 𝑭 with additive term 𝑫
Signal model:
Many orthogonal penalty parameters are needed but uncontrollable.
Strong sensitivity to initial value
𝒀mix ≈ 𝑭 +𝑫 𝑮 +𝑯𝑼
𝑭
𝑯 𝑫
SNMF with Time-Invariant Basis Deformation (TID) [Nakajima, et al., EUSIPCO2016]
Training
Separation
Supervision
𝑭org
・Source separation and basis deformation are independently processed. ・Basis deformation is performed via target given by generalized MMSE-STSA estimator. ・Iterative basis deformation [Breithaupt, et al., 2008]
SNMF with Time-Invariant Basis Deformation (TID) [Nakajima, et al., EUSIPCO2016]
Training
Separation
Generation of target by generalized MMSE-STSA
estimator
Basis deformation
Supervision
𝑭org
Interference
𝒀mix − 𝑭𝑮
Estimated target 𝒀 Binary mask 𝑰
𝑭 ← 𝑨𝑭org
𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭org𝑮)
Hereafter we propose an improved algorithm introducing time variance.
Diagonal matrix with all-pole-model-based deformation
・Source separation and basis deformation are independently processed. ・Basis deformation is performed via target given by generalized MMSE-STSA estimator. ・Iterative basis deformation
To extract convincing 𝒀
[Breithaupt, et al., 2008]
Proposed Discriminative Time-Variant Deformation
① Supervised basis is classified to 2 parts, capturing time-variant nature. ② Exceeding deformation is avoided by discriminative training.
Training
Separation
Generation of target by generalized MMSE-STSA
estimator
Basis deformation
Supervision
𝑭org
Interference
𝒀mix − 𝑭𝑮
Estimated target 𝒀 Binary mask 𝑰
𝑭 ← 𝑨𝑭org
𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭org𝑮)
Proposed Discriminative Time-Variant Deformation
Supervision
𝑭org= [𝑭atk, 𝑭sus]
𝑭 ← [𝑨𝑭atk, 𝑩𝑭sus]
① Supervised basis is classified to 2 parts, capturing time-variant nature. ② Exceeding deformation is avoided by discriminative training.
Training
Separation
Generation of target by generalized MMSE-STSA
estimator
Interference
𝒀mix − 𝑭𝑮
Estimated target 𝒀 Binary mask 𝑰
𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭org𝑮)
Discriminative basis deformation considering
interference
①
②
Proposed ①: Time Variance in Instruments
Basis deformation model should be changed in accordance with difference in physical mechanism of articulation.
Ex: Piano articulation
String
Hammer
• Physical mechanism is different in Attack and Sustain in music instruments. [N. H. Fletcher, 1991]
Initial state Flip string (transitional)
Free vibration
Proposed ①: Basis Classification
• Bases is classified in accordance with frequency of attack and sustain generation.
• In each basis group, we apply difference deformation model.
≈ 𝑭org𝑮atk
≈ 𝑭org 𝑮sus
Classify 𝑭org into 𝑭1 and 𝑭2 based on k-means method
Frequency of attack part for each basis Frequency of sustain part for each basis
Truncate sustain part in training sample
Truncate attack part in training sample Time
Time Time
Proposed ①: Deformation Model
𝒀 : Estimated target by generalized MMSE-STSA estimator
𝑰 : Binary mask for sampling convincing components
𝑭𝟏 : Supervised basis trained using attack part only 𝑭𝟐 : Supervised basis trained using sustain part only
𝑨 : Diagonal matrix with all−pole−model spectrum to deform 𝑭𝟏
𝑩 : Diagonal matrix with all−pole−model spectrum to deform 𝑭𝟐
𝑮𝟏, 𝑮𝟐 : Activation matrices corresponding to 𝑭𝟏, 𝑭𝟐
: Hadamard product
𝑰 ○ 𝒀 ≈ 𝑰 ○ (𝑨𝑭1𝑮𝟏 +𝑩𝑭2𝑮𝟐)
Deformation parameters
• We prepare different deformation models for attack and sustain.
Proposed ①: Parameter Update Cost function
based on KL div.
Parameter update
by auxiliary-
function method
Proposed ②:Discriminative Basis Deformation
• Large degree of freedom in A, B often allows to represent interference, resulting in deterioration of separation accuracy.
• Discriminative deformation can mitigate such side effects.
Formulation as Bilevel Optimization
→ 𝑨𝑭𝟏𝑮𝟏 + 𝑩𝑭𝟐𝑮𝟐 is hard to represent interference component in 𝒀.
Owing to this cost, target and interference components are separately modeled.
Target component Interference component
subject to 𝑮𝟏 ,𝑮𝟐 = arg min
𝑮𝟏,𝑮𝟐,𝑯,𝑼(𝑰 ∘ 𝒀mix|𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐 +𝑯𝑼))
𝑨,𝑩 = arg min𝑨,𝑩
(𝑰 ∘ 𝒀|𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐 )) Fitness for
target Y only
Fitness for mixture 𝒀mix
Unfortunately this problem is hard to be solved, so we propose an approximated solver algorithm.
Proposed ②:Approximated Algorithm
• Step 1: Initialization (the same as conventional one) min
𝑨,𝑮𝟏 ,𝑩,𝑮𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐 𝑮𝟐 )
• Step 2: Modeling of mixture Ymix min
𝑮𝟏,𝑮𝟐,𝑯,𝑼𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 + 𝑩𝑭𝟐𝑮𝟐 +𝑯𝑼))
• Step 3: Modeling of target Y
min𝑨,𝑩𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐))
Fixing basis deformation matrix, we estimate activation.
Fixing activation matrix, we estimate deformation matrix.
We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture.
Proposed ②:Approximated Algorithm
• Step 1: Initialization (the same as conventional one) min
𝑨,𝑮𝟏 ,𝑩,𝑮𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐 𝑮𝟐 )
• Step 2: Modeling of mixture Ymix min
𝑮𝟏,𝑮𝟐,𝑯,𝑼𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 + 𝑩𝑭𝟐𝑮𝟐 +𝑯𝑼))
• Step 3: Modeling of target Y
min𝑨,𝑩𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐))
Fixing basis deformation matrix, we estimate activation.
Fixing activation matrix, we estimate deformation matrix.
We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture.
Proposed ②:Approximated Algorithm
• Step 1: Initialization (the same as conventional one) min
𝑨,𝑮𝟏 ,𝑩,𝑮𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐 𝑮𝟐 )
• Step 2: Modeling of mixture Ymix min
𝑮𝟏,𝑮𝟐,𝑯,𝑼𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 + 𝑩𝑭𝟐𝑮𝟐 +𝑯𝑼))
• Step 3: Modeling of target Y
min𝑨,𝑩𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐))
We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture.
Fixing basis deformation matrix, we estimate activation.
Fixing activation matrix, we estimate deformation matrix.
Proposed ②:Approximated Algorithm
• Step 1: Initialization (the same as conventional one) min
𝑨,𝑮𝟏 ,𝑩,𝑮𝟐 𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ 𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐 𝑮𝟐 )
• Step 2: Modeling of mixture Ymix min
𝑮𝟏,𝑮𝟐,𝑯,𝑼𝐷(𝑰 ∘ 𝒀mix||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 + 𝑩𝑭𝟐𝑮𝟐 +𝑯𝑼))
• Step 3: Modeling of target Y
min𝑨,𝑩𝐷(𝑰 ∘ 𝒀 ||𝑰 ∘ (𝑨𝑭𝟏𝑮𝟏 +𝑩𝑭𝟐𝑮𝟐))
Fixing basis deformation matrix, we estimate activation.
Fixing activation matrix, we estimate deformation matrix.
We iteratively search set of deformation matrices that represent target spectrogram in the vicinity of those that fit for mixture.
Experimental Evaluation: Condition
Instruments Oboe (Ob.), Piano (Pf.), Trombone (Tb.)
Training (MIDI) Garritan Professional Orchestra
Open target (MIDI) Microsoft GS Wavetable SW Synth
Sampling freq. 44100 Hz
FFT length 4096 points (100 ms)
Shift length 512 points (15 ms)
# of bases Target: 100, Interference: 30
Truncation period for extraction of attack
50 ms
Comparison Conventional methods: SNMF, SNMF-ABD, TID
Proposed method
Evaluation score Signal-to-Distortion Ratio (SDR) [dB]
(for evaluating total quality of separated signal)
• Different MIDI generators were used for training and open data. • Source separation for 2-sound mixture using supervised basis.
Music Score Used in Experiment
・Open data (mixture)
・Training samples
Oboe
Piano
Trombone
Oboe
Piano
Trombone
• 2 octave chromatic scale
• Test song for NMF research [Kitamura, 2014]
Results 1: Example
Ex. Piano-sound extraction from mixture of oboe and piano
Better SDR rather than conventional methods
Results 2: Overall Evaluation
SNMF [dB]
SNMF-ABD [dB]
TID [dB]
Proposed [dB]
Ob. & Pf. 6.7 8.1 6.7 7.0
Ob. & Tb. 2.4 2.6 2.8 2.9
Pf. & Ob. 4.1 3.6 5.2 6.1
Pf. & Tb. 3.1 3.2 4.5 4.5
Tb. & Ob. 0.7 0.2 2.4 2.8
Tb. & Pf. 2.9 2.6 3.9 4.4
“A & B” means task for extraction of “A” from mixture of A and B.
SNMF-ABD: Basis deformation NMF in parallel with separation TID: Time-invariant deformation NMF without considering interference
Results 2: Overall Evaluation
SNMF [dB]
SNMF-ABD [dB]
TID [dB]
Proposed [dB]
Ob. & Pf. 6.7 8.1 6.7 7.0
Ob. & Tb. 2.4 2.6 2.8 2.9
Pf. & Ob. 4.1 3.6 5.2 6.1
Pf. & Tb. 3.1 3.2 4.5 4.5
Tb. & Ob. 0.7 0.2 2.4 2.8
Tb. & Pf. 2.9 2.6 3.9 4.4
Proposed method outperforms SNMF and TID in all combination.
In only one case, SNMF-ABD wins but loses in the other cases.
Conclusion
• In this study, we propose a new advanced SNMF that includes time-variant (attack & sustain) deformation of the trained basis to make it fit the target sound.
• Also, to avoid the exceeding deformation, we propose a discriminative basis deformation. In order to solve the bilevel optimization problem, we introduce an approximated algorithm.
• From the experimental results, it was confirmed that the proposed method outperforms the conventional methods in many cases.
Thank you for your attention!