36
結結結結結結結結結 結結結結結結結結結 Combining Different Levels of Features for Emotion Recognition in Speech 清清清清清清清清清 9762511 清清清 清清清清清

清華大學資訊工程所 9762511 李芝宇 張智星教授 2 / 35 Introduction Related Work The Proposed Approaches ◦ Likelihood Combination (LC) ◦ Weighted Likelihood Combination

Embed Size (px)

Citation preview

  • Slide 1
  • Slide 2
  • 9762511
  • Slide 3
  • 2 / 35 Introduction Related Work The Proposed Approaches Likelihood Combination (LC) Weighted Likelihood Combination (WLC) Raw Feature Combination (RFC) Partial Raw Feature Combination (PRFC) Experiments Conclusions Future Work
  • Slide 4
  • 3 / 35 Emotion Recognition We talk with different emotions, such as sadness, happiness, angeretc. Goal Design an emotion recognition system with a promising recognition performance Applications Mostly used in human-computer-interface, robot- human-interaction
  • Slide 5
  • 4 / 35 D.N. Jiang and L.H. Cai, Speech emotion classification with the combination of statistic features and temporal features, in IEEE International Conference on Multimedia and Expo, Vol. 3, pp. 1967-1970, June 2004 B. Vlasenko et al., "Combining frame and turn-level information for robust recognition of emotions within speech," in Interspeech, 2007, pp. 2225-2228. B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, Wendemuth, Acoustic emotion recognition : A benchmark comparison of performances, in IEEE ASRU 2009 B. Schuller, G.Rigoll, Timing Levels in Segment-Based Speech Emotion Recognition, Interspeech, ICSLP, ISCA, pp. 1818- 1821, 2006 AuthorDatabaseTiming-levelClassifier Jiang et al. (2004 ICME) Not public (Chinese) Utterance Syllable GMM,HMM, MLP,WBC Chetouani et al. (2009 Cogn. Computing) EMODB (German) Aholab (Spanish) Vowel Consonant GMM, SVM Schuller et al. (2006 ICSLP) EMODB (German) Utterance 3-Segment SVM Vlasenko et al. (2007 INTER) EMODB (German) SUSAS (English) Utterance Frame GMM, SVM
  • Slide 6
  • 5 / 35 Frame- Level features Segment- Level features Utterance- Level Features LC/ WLC/ RFC/ PRFC Feature Extraction Feature Selection Corpus Training set Test set model Training Classified Result Features Combination Methods Classifier
  • Slide 7
  • 6 / 35 Preprocessing End-point detection Feature extraction Frame-level (FL): 39-MFCCs, pitch, 12-LFPC Segment-level (SL) : Low-level descriptors (LLD) with functionals 16X2X12X3=384X3=1152 Utterance-level (UL) : 16X2X12=384 LLD (16*2)Functionals (12) () ZCRmean energy standard deviation kurtosis, skewness extremes: value, relative position, range linear regression: offset, slope, MSE () RMS () F0 () HNR () MFCC 1-12 *OpenEAR feature extraction toolkit
  • Slide 8
  • 7 / 35 Use the likelihoods of each emotion as the input for the final classifier Frame-level Features Frame-level Features Segment-level Features Segment-level Features Utterance-level Features Utterance-level Features Frame-level HMMs Frame-level Likelihood of each emotion Frame-level Likelihood of each emotion Segment-level Likelihood of each emotion Segment-level Likelihood of each emotion Utterance-level Likelihood of each emotion Utterance-level Likelihood of each emotion Input to the final classifier Segment-level GMMs Utterance-level GMMs Utterance-level GMMs Syllable-level Features Syllable-level Features Syllable-level HMMs Syllable-level HMMs Syllable- level Likelihood of each emotion Syllable- level Likelihood of each emotion Jiang et al. approach My proposed LC method
  • Slide 9
  • 8 / 35 Assume there are 4 kinds of emotion b 31 b 32 b 33 b 34 Frame-level Likelihood of each emotion Frame-level Likelihood of each emotion Segment-level Likelihood of each emotion Segment-level Likelihood of each emotion Utterance-level Likelihood of each emotion Utterance-level Likelihood of each emotion Input to the final classifier a 1 a 2 a 3 a 4 b 31 b 32 b 33 b 34 c 1 c 2 c 3 c 4 b 11 b 12 b 13 b 14 b 21 b 22 b 23 b 24 An Utterance a i =the FL LH of the i th emotion b ij =the i-SL LH of the j th emotion c i =the UL LH of the i th emotion
  • Slide 10
  • 9 / 35 Sum up different timing-level likelihoods with weights Frame-level Features Frame-level Features Segment-level Features Segment-level Features Utterance-level Features Utterance-level Features Segment-level GMMs Frame-level Likelihoods of each emotion Frame-level Likelihoods of each emotion Segment-level Likelihoods of each emotion Segment-level Likelihoods of each emotion Utterance- level Likelihoods of each emotion Utterance- level Likelihoods of each emotion Result 1*1*+ 2 *+ 3 * 1 + 2 + 3 =1 1, 2, 3 >=0 Frame-level HMMs Utterance-level GMMs V-L Features V-L Features C-L Features C-L Features V-L GMMs V-L GMMs C-L GMMs C-L GMMs V-L LHs of each emotion V-L LHs of each emotion C-L LHs of each emotion C-L LHs of each emotion Chetouoani et al. approachThe proposed WLC method 1*1*2*2*
  • Slide 11
  • 10 / 35 Frame-level Likelihood of each emotion Frame-level Likelihood of each emotion Segment-level Likelihood of each emotion Segment-level Likelihood of each emotion Utterance-level Likelihood of each emotion Utterance-level Likelihood of each emotion 0.2 0.4 0.1 0.60.7 0.5 0.2 0.80.1 0.6 0.4 0.2 1 * = 0.47 0.52 0.25 0.6 Since 0.6 is the max of the result, this utterance belongs to #4 emotion +2 *+3 * 0.8 0.6 0.3 0.9 0.7 0.5 0.2 0.8 0.6 0.4 0.1 0.7 Avg. Assume there are 3 kinds of emotion, the optimized set is [ 1 2 3 ]=[0.1 0.6 0.3]
  • Slide 12
  • 11 / 35 Duplicate segment-level and utterance -level features and concatenate to frame-level features Segment-level feature Frame-level Feature Input to the final classifier Utterance-level Feature Segment-level Feature Schuller et al. approach The proposed RFC method
  • Slide 13
  • 12 / 35 a1 a2 a3 a4 a5 a6 a7 a8 a9 frame b1 b1 b2 b3 c c dimension b1 b2 b3 c c c c c c c c c c c c c c c c c c Input to the GMM classifier Assume there are 9 frames for an utterance
  • Slide 14
  • 13 / 35 Combine frame-level likelihoods with the segment/utterance-level features Frame-level Feature Segment-level Feature Segment-level Feature Utterance- level Feature Utterance- level Feature Frame-level HMMs Utterance-level Feature Utterance-level Feature Input to the final classifier Frame-level Likelihoods of Each Emotion Segment-level Feature Vlasenko et al. approach The proposed PRFC method
  • Slide 15
  • 14 / 35 FL likelihood FL likelihood Input to the GMM classifier Segment 1 feature FL likelihood FL likelihood Segment 2 feature Segment 3 feature FL features FL features FL HMMs SL features SL features UL features UL features UL feature FL likelihood FL likelihood FL likelihood FL likelihood UL feature Frame-level features are transformed to utterance-level features
  • Slide 16
  • 15 / 35 Timing-level Proposed method Frame-level (FL) Segment-level (SL) Utterance-level (UL) LCULSLUL WLCUL UL RFCFLFL PRFCUL, SLSLSL
  • Slide 17
  • 16 / 35 Emotional DBEMODB (Berlin)eNTERFACE05 LanguageGermanEnglish Emotions (# of sentence) Anger (126) Boredom (80) Disgust (44) Anxiety (66) Happiness (70) Sadness (62) Neutral (78) Anger (210) Disgust (210) Anxiety (210) Happiness (210) Sadness (210) Surprise (210) # Subject (nation)10 (5f/5m) (German)42 (8f/34m) (14 nations) Recorded situationActed Judge (#)Pass 80% (30)Pass 100% (2) # All526 sentences1277 sentences Total length22mins1hr
  • Slide 18
  • 17 / 35 Sequential Forward Feature Selection (SFFS) Start from an empty feature set and sequentially include a feature into the selected set Evaluate the performance of each set with different classifiers, then pick the best feature set Frame-level features (HMM classifier,3 states, 8 components ) Time consuming for FL features to do SFFS Any feature other than MFCC showed a decrease in accuracy
  • Slide 19
  • 18 / 35 Segment-level features (GMM classifier, 32 component) 1-segment : Reduce the dimension of feature space from 384 (16X2X12) to 189 2-segment : Reduce the dimension of feature space from 384 to 253 3-segment : Reduce the dimension of feature space from 384 to 221 Utterance-level features (GMM classifier, 32 component) Reduce the dimension of feature space from 384 to 230
  • Slide 20
  • 19 / 35 Leave-one-speaker-out for EMODB Four-fold-cross-validation for eNTERFACE FoldGender 12 females +8 males 2 32 females +9 males 4
  • Slide 21
  • 20 / 35 Signal-timing-level FL, SL, UL Dual-timing-level FL+SL, FL+UL, SL+UL Trio-timing-level FL+SL+UL Timing-levelFLSLUL ClassifierHMM (3-state)GMM Component (EMODB/eNTER) 64 / 6432 / 16
  • Slide 22
  • 21 / 35 Likelihood Combination final classifier: libSVM (linear kernel, c=1)
  • Slide 23
  • 22 / 35 Likelihood Combination Frame-level feature has the best recognition rate in the three individual feature set in EMODB Combination of different timing level features shows improvement Segment-level feature do not perform well in eNTERFACE
  • Slide 24
  • 23 / 35 Find the optimized weighting factors by brute force Dual-timing-level likelihoods 1 + 2 = 1 Tuning gap = 0.01 100 iterations Trio-timing-level likelihoods 1 + 2 + 3 = 1 Tuning gap = 0.01 10000 iterations (100*100)
  • Slide 25
  • 24 / 35 Weighted Likelihood Combination first classifier: HMM, GMM
  • Slide 26
  • 25 / 35 Weighting factor Timing-level combination FL+SLFL+ULSL+ULFL+SL+UL EMODB75.1%74.2%67.3%76.4% weight set (EMODB) 0.73, 0.270.89, 0.110.62, 0.280.61, 0.25, 0.14 eNTER64.2%62.5%60.9%66.4% weight set (eNTER) 0.84, 0.160.63, 0.470.42, 0.580.86, 0.05, 0.09
  • Slide 27
  • 26 / 35 Weighted Likelihood Combination The combination of 3 timing-level likelihoods didnt show much improvement By combining different timing-level with weighting factors did show improvement, but not much The weighting factor for frame-level likelihood is always the biggest
  • Slide 28
  • 27 / 35 Find the optimized Gaussian mixture component 64(EMODB)/32(eNTER), for the final classifier GMM, FL+SL+UL O O
  • Slide 29
  • 28 / 35 Raw Feature Combination Final classifier: GMM (mix. comp.=64/32)
  • Slide 30
  • 29 / 35 Raw Feature Combination Preserve the most information of different timing levels feature Too time-consuming (approx. 5hr, LOSO, GMM, EMODB), not efficient
  • Slide 31
  • 30 / 35 Find the optimized Gaussian mixture component 256 (EMODB)/32 (eNTER), for the final classifier - GMM, FL+SL+UL O O
  • Slide 32
  • 31 / 35 Partial Raw Feature Combination Final classifier: GMM (mix. comp.=256/32)
  • Slide 33
  • 32 / 35 Partial Raw Feature Combination Have less feature dimension than RFC, so the computation time is less, but still long (approx. 3-4 hrs, LOSO, EMODB, GMM) Trio-timing-level features still show better performance than single timing-level and dual timing-level combination
  • Slide 34
  • 33 / 35 Noise in eNTERFACE EMODB eNTERFACE Der Lappen liegt auf dem Eisschrank. (The tablecloth is lying on the frigde.) No, no, no! I need this money!
  • Slide 35
  • 34 / 35 Likelihood combination method shows the best recognition result Trio-timing level features achieve the best recognition rate Different timing-level features could compensate each other Frame-level features play the most important role
  • Slide 36
  • 35 / 35 Variable size of segment Word? Syllable? An automatic segmentation way Learning to rank approach Map the emotion to 2-dimension Give a ranking label for each emotion
  • Slide 37
  • 36 / 35 Thank you!
  • Slide 38
  • 37 / 35