清華大學資訊工程所 9762511 李芝宇張智星教授 2 / 35 Introduction Related Work The Proposed Approaches ◦ Likelihood Combination (LC) ◦ Weighted Likelihood Combination

9762511

2 / 35 Introduction Related Work The Proposed Approaches Likelihood Combination (LC) Weighted Likelihood Combination (WLC) Raw Feature Combination (RFC) Partial Raw Feature Combination (PRFC) Experiments Conclusions Future Work

3 / 35 Emotion Recognition We talk with different emotions, such as sadness, happiness, angeretc. Goal Design an emotion recognition system with a promising recognition performance Applications Mostly used in human-computer-interface, robot- human-interaction

4 / 35 D.N. Jiang and L.H. Cai, Speech emotion classification with the combination of statistic features and temporal features, in IEEE International Conference on Multimedia and Expo, Vol. 3, pp. 1967-1970, June 2004 B. Vlasenko et al., "Combining frame and turn-level information for robust recognition of emotions within speech," in Interspeech, 2007, pp. 2225-2228. B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, Wendemuth, Acoustic emotion recognition : A benchmark comparison of performances, in IEEE ASRU 2009 B. Schuller, G.Rigoll, Timing Levels in Segment-Based Speech Emotion Recognition, Interspeech, ICSLP, ISCA, pp. 1818- 1821, 2006 AuthorDatabaseTiming-levelClassifier Jiang et al. (2004 ICME) Not public (Chinese) Utterance Syllable GMM,HMM, MLP,WBC Chetouani et al. (2009 Cogn. Computing) EMODB (German) Aholab (Spanish) Vowel Consonant GMM, SVM Schuller et al. (2006 ICSLP) EMODB (German) Utterance 3-Segment SVM Vlasenko et al. (2007 INTER) EMODB (German) SUSAS (English) Utterance Frame GMM, SVM

5 / 35 Frame- Level features Segment- Level features Utterance- Level Features LC/ WLC/ RFC/ PRFC Feature Extraction Feature Selection Corpus Training set Test set model Training Classified Result Features Combination Methods Classifier

6 / 35 Preprocessing End-point detection Feature extraction Frame-level (FL): 39-MFCCs, pitch, 12-LFPC Segment-level (SL) : Low-level descriptors (LLD) with functionals 16X2X12X3=384X3=1152 Utterance-level (UL) : 16X2X12=384 LLD (16*2)Functionals (12) () ZCRmean energy standard deviation kurtosis, skewness extremes: value, relative position, range linear regression: offset, slope, MSE () RMS () F0 () HNR () MFCC 1-12 *OpenEAR feature extraction toolkit

7 / 35 Use the likelihoods of each emotion as the input for the final classifier Frame-level Features Frame-level Features Segment-level Features Segment-level Features Utterance-level Features Utterance-level Features Frame-level HMMs Frame-level Likelihood of each emotion Frame-level Likelihood of each emotion Segment-level Likelihood of each emotion Segment-level Likelihood of each emotion Utterance-level Likelihood of each emotion Utterance-level Likelihood of each emotion Input to the final classifier Segment-level GMMs Utterance-level GMMs Utterance-level GMMs Syllable-level Features Syllable-level Features Syllable-level HMMs Syllable-level HMMs Syllable- level Likelihood of each emotion Syllable- level Likelihood of each emotion Jiang et al. approach My proposed LC method

8 / 35 Assume there are 4 kinds of emotion b 31 b 32 b 33 b 34 Frame-level Likelihood of each emotion Frame-level Likelihood of each emotion Segment-level Likelihood of each emotion Segment-level Likelihood of each emotion Utterance-level Likelihood of each emotion Utterance-level Likelihood of each emotion Input to the final classifier a 1 a 2 a 3 a 4 b 31 b 32 b 33 b 34 c 1 c 2 c 3 c 4 b 11 b 12 b 13 b 14 b 21 b 22 b 23 b 24 An Utterance a i =the FL LH of the i th emotion b ij =the i-SL LH of the j th emotion c i =the UL LH of the i th emotion

9 / 35 Sum up different timing-level likelihoods with weights Frame-level Features Frame-level Features Segment-level Features Segment-level Features Utterance-level Features Utterance-level Features Segment-level GMMs Frame-level Likelihoods of each emotion Frame-level Likelihoods of each emotion Segment-level Likelihoods of each emotion Segment-level Likelihoods of each emotion Utterance- level Likelihoods of each emotion Utterance- level Likelihoods of each emotion Result 1*1*+ 2 *+ 3 * 1 + 2 + 3 =1 1, 2, 3 >=0 Frame-level HMMs Utterance-level GMMs V-L Features V-L Features C-L Features C-L Features V-L GMMs V-L GMMs C-L GMMs C-L GMMs V-L LHs of each emotion V-L LHs of each emotion C-L LHs of each emotion C-L LHs of each emotion Chetouoani et al. approachThe proposed WLC method 1*1*2*2*

10 / 35 Frame-level Likelihood of each emotion Frame-level Likelihood of each emotion Segment-level Likelihood of each emotion Segment-level Likelihood of each emotion Utterance-level Likelihood of each emotion Utterance-level Likelihood of each emotion 0.2 0.4 0.1 0.60.7 0.5 0.2 0.80.1 0.6 0.4 0.2 1 * = 0.47 0.52 0.25 0.6 Since 0.6 is the max of the result, this utterance belongs to #4 emotion +2 *+3 * 0.8 0.6 0.3 0.9 0.7 0.5 0.2 0.8 0.6 0.4 0.1 0.7 Avg. Assume there are 3 kinds of emotion, the optimized set is [ 1 2 3 ]=[0.1 0.6 0.3]

11 / 35 Duplicate segment-level and utterance -level features and concatenate to frame-level features Segment-level feature Frame-level Feature Input to the final classifier Utterance-level Feature Segment-level Feature Schuller et al. approach The proposed RFC method

12 / 35 a1 a2 a3 a4 a5 a6 a7 a8 a9 frame b1 b1 b2 b3 c c dimension b1 b2 b3 c c c c c c c c c c c c c c c c c c Input to the GMM classifier Assume there are 9 frames for an utterance

13 / 35 Combine frame-level likelihoods with the segment/utterance-level features Frame-level Feature Segment-level Feature Segment-level Feature Utterance- level Feature Utterance- level Feature Frame-level HMMs Utterance-level Feature Utterance-level Feature Input to the final classifier Frame-level Likelihoods of Each Emotion Segment-level Feature Vlasenko et al. approach The proposed PRFC method

14 / 35 FL likelihood FL likelihood Input to the GMM classifier Segment 1 feature FL likelihood FL likelihood Segment 2 feature Segment 3 feature FL features FL features FL HMMs SL features SL features UL features UL features UL feature FL likelihood FL likelihood FL likelihood FL likelihood UL feature Frame-level features are transformed to utterance-level features

15 / 35 Timing-level Proposed method Frame-level (FL) Segment-level (SL) Utterance-level (UL) LCULSLUL WLCUL UL RFCFLFL PRFCUL, SLSLSL

16 / 35 Emotional DBEMODB (Berlin)eNTERFACE05 LanguageGermanEnglish Emotions (# of sentence) Anger (126) Boredom (80) Disgust (44) Anxiety (66) Happiness (70) Sadness (62) Neutral (78) Anger (210) Disgust (210) Anxiety (210) Happiness (210) Sadness (210) Surprise (210) # Subject (nation)10 (5f/5m) (German)42 (8f/34m) (14 nations) Recorded situationActed Judge (#)Pass 80% (30)Pass 100% (2) # All526 sentences1277 sentences Total length22mins1hr

17 / 35 Sequential Forward Feature Selection (SFFS) Start from an empty feature set and sequentially include a feature into the selected set Evaluate the performance of each set with different classifiers, then pick the best feature set Frame-level features (HMM classifier,3 states, 8 components ) Time consuming for FL features to do SFFS Any feature other than MFCC showed a decrease in accuracy

18 / 35 Segment-level features (GMM classifier, 32 component) 1-segment : Reduce the dimension of feature space from 384 (16X2X12) to 189 2-segment : Reduce the dimension of feature space from 384 to 253 3-segment : Reduce the dimension of feature space from 384 to 221 Utterance-level features (GMM classifier, 32 component) Reduce the dimension of feature space from 384 to 230

19 / 35 Leave-one-speaker-out for EMODB Four-fold-cross-validation for eNTERFACE FoldGender 12 females +8 males 2 32 females +9 males 4

20 / 35 Signal-timing-level FL, SL, UL Dual-timing-level FL+SL, FL+UL, SL+UL Trio-timing-level FL+SL+UL Timing-levelFLSLUL ClassifierHMM (3-state)GMM Component (EMODB/eNTER) 64 / 6432 / 16

21 / 35 Likelihood Combination final classifier: libSVM (linear kernel, c=1)

22 / 35 Likelihood Combination Frame-level feature has the best recognition rate in the three individual feature set in EMODB Combination of different timing level features shows improvement Segment-level feature do not perform well in eNTERFACE

23 / 35 Find the optimized weighting factors by brute force Dual-timing-level likelihoods 1 + 2 = 1 Tuning gap = 0.01 100 iterations Trio-timing-level likelihoods 1 + 2 + 3 = 1 Tuning gap = 0.01 10000 iterations (100*100)

24 / 35 Weighted Likelihood Combination first classifier: HMM, GMM

25 / 35 Weighting factor Timing-level combination FL+SLFL+ULSL+ULFL+SL+UL EMODB75.1%74.2%67.3%76.4% weight set (EMODB) 0.73, 0.270.89, 0.110.62, 0.280.61, 0.25, 0.14 eNTER64.2%62.5%60.9%66.4% weight set (eNTER) 0.84, 0.160.63, 0.470.42, 0.580.86, 0.05, 0.09

26 / 35 Weighted Likelihood Combination The combination of 3 timing-level likelihoods didnt show much improvement By combining different timing-level with weighting factors did show improvement, but not much The weighting factor for frame-level likelihood is always the biggest

27 / 35 Find the optimized Gaussian mixture component 64(EMODB)/32(eNTER), for the final classifier GMM, FL+SL+UL O O

28 / 35 Raw Feature Combination Final classifier: GMM (mix. comp.=64/32)

29 / 35 Raw Feature Combination Preserve the most information of different timing levels feature Too time-consuming (approx. 5hr, LOSO, GMM, EMODB), not efficient

30 / 35 Find the optimized Gaussian mixture component 256 (EMODB)/32 (eNTER), for the final classifier - GMM, FL+SL+UL O O

31 / 35 Partial Raw Feature Combination Final classifier: GMM (mix. comp.=256/32)

32 / 35 Partial Raw Feature Combination Have less feature dimension than RFC, so the computation time is less, but still long (approx. 3-4 hrs, LOSO, EMODB, GMM) Trio-timing-level features still show better performance than single timing-level and dual timing-level combination

33 / 35 Noise in eNTERFACE EMODB eNTERFACE Der Lappen liegt auf dem Eisschrank. (The tablecloth is lying on the frigde.) No, no, no! I need this money!

34 / 35 Likelihood combination method shows the best recognition result Trio-timing level features achieve the best recognition rate Different timing-level features could compensate each other Frame-level features play the most important role

35 / 35 Variable size of segment Word? Syllable? An automatic segmentation way Learning to rank approach Map the emotion to 2-dimension Give a ranking label for each emotion

36 / 35 Thank you!

37 / 35

Documents

清華大學資訊工程所 9762511 李芝宇 張智星教授 2 / 35 Introduction Related Work The Proposed Approaches ◦ Likelihood Combination (LC) ◦ Weighted Likelihood Combination

清華大學資訊工程所 9762511 李芝宇張智星教授 2 / 35 Introduction Related Work The Proposed Approaches ◦ Likelihood Combination (LC) ◦ Weighted Likelihood Combination