2 / 35 Introduction Related Work The Proposed Approaches
Likelihood Combination (LC) Weighted Likelihood Combination (WLC)
Raw Feature Combination (RFC) Partial Raw Feature Combination
(PRFC) Experiments Conclusions Future Work
Slide 4
3 / 35 Emotion Recognition We talk with different emotions,
such as sadness, happiness, angeretc. Goal Design an emotion
recognition system with a promising recognition performance
Applications Mostly used in human-computer-interface, robot-
human-interaction
Slide 5
4 / 35 D.N. Jiang and L.H. Cai, Speech emotion classification
with the combination of statistic features and temporal features,
in IEEE International Conference on Multimedia and Expo, Vol. 3,
pp. 1967-1970, June 2004 B. Vlasenko et al., "Combining frame and
turn-level information for robust recognition of emotions within
speech," in Interspeech, 2007, pp. 2225-2228. B. Schuller, B.
Vlasenko, F. Eyben, G. Rigoll, Wendemuth, Acoustic emotion
recognition : A benchmark comparison of performances, in IEEE ASRU
2009 B. Schuller, G.Rigoll, Timing Levels in Segment-Based Speech
Emotion Recognition, Interspeech, ICSLP, ISCA, pp. 1818- 1821, 2006
AuthorDatabaseTiming-levelClassifier Jiang et al. (2004 ICME) Not
public (Chinese) Utterance Syllable GMM,HMM, MLP,WBC Chetouani et
al. (2009 Cogn. Computing) EMODB (German) Aholab (Spanish) Vowel
Consonant GMM, SVM Schuller et al. (2006 ICSLP) EMODB (German)
Utterance 3-Segment SVM Vlasenko et al. (2007 INTER) EMODB (German)
SUSAS (English) Utterance Frame GMM, SVM
Slide 6
5 / 35 Frame- Level features Segment- Level features Utterance-
Level Features LC/ WLC/ RFC/ PRFC Feature Extraction Feature
Selection Corpus Training set Test set model Training Classified
Result Features Combination Methods Classifier
7 / 35 Use the likelihoods of each emotion as the input for the
final classifier Frame-level Features Frame-level Features
Segment-level Features Segment-level Features Utterance-level
Features Utterance-level Features Frame-level HMMs Frame-level
Likelihood of each emotion Frame-level Likelihood of each emotion
Segment-level Likelihood of each emotion Segment-level Likelihood
of each emotion Utterance-level Likelihood of each emotion
Utterance-level Likelihood of each emotion Input to the final
classifier Segment-level GMMs Utterance-level GMMs Utterance-level
GMMs Syllable-level Features Syllable-level Features Syllable-level
HMMs Syllable-level HMMs Syllable- level Likelihood of each emotion
Syllable- level Likelihood of each emotion Jiang et al. approach My
proposed LC method
Slide 9
8 / 35 Assume there are 4 kinds of emotion b 31 b 32 b 33 b 34
Frame-level Likelihood of each emotion Frame-level Likelihood of
each emotion Segment-level Likelihood of each emotion Segment-level
Likelihood of each emotion Utterance-level Likelihood of each
emotion Utterance-level Likelihood of each emotion Input to the
final classifier a 1 a 2 a 3 a 4 b 31 b 32 b 33 b 34 c 1 c 2 c 3 c
4 b 11 b 12 b 13 b 14 b 21 b 22 b 23 b 24 An Utterance a i =the FL
LH of the i th emotion b ij =the i-SL LH of the j th emotion c i
=the UL LH of the i th emotion
Slide 10
9 / 35 Sum up different timing-level likelihoods with weights
Frame-level Features Frame-level Features Segment-level Features
Segment-level Features Utterance-level Features Utterance-level
Features Segment-level GMMs Frame-level Likelihoods of each emotion
Frame-level Likelihoods of each emotion Segment-level Likelihoods
of each emotion Segment-level Likelihoods of each emotion
Utterance- level Likelihoods of each emotion Utterance- level
Likelihoods of each emotion Result 1*1*+ 2 *+ 3 * 1 + 2 + 3 =1 1,
2, 3 >=0 Frame-level HMMs Utterance-level GMMs V-L Features V-L
Features C-L Features C-L Features V-L GMMs V-L GMMs C-L GMMs C-L
GMMs V-L LHs of each emotion V-L LHs of each emotion C-L LHs of
each emotion C-L LHs of each emotion Chetouoani et al. approachThe
proposed WLC method 1*1*2*2*
Slide 11
10 / 35 Frame-level Likelihood of each emotion Frame-level
Likelihood of each emotion Segment-level Likelihood of each emotion
Segment-level Likelihood of each emotion Utterance-level Likelihood
of each emotion Utterance-level Likelihood of each emotion 0.2 0.4
0.1 0.60.7 0.5 0.2 0.80.1 0.6 0.4 0.2 1 * = 0.47 0.52 0.25 0.6
Since 0.6 is the max of the result, this utterance belongs to #4
emotion +2 *+3 * 0.8 0.6 0.3 0.9 0.7 0.5 0.2 0.8 0.6 0.4 0.1 0.7
Avg. Assume there are 3 kinds of emotion, the optimized set is [ 1
2 3 ]=[0.1 0.6 0.3]
Slide 12
11 / 35 Duplicate segment-level and utterance -level features
and concatenate to frame-level features Segment-level feature
Frame-level Feature Input to the final classifier Utterance-level
Feature Segment-level Feature Schuller et al. approach The proposed
RFC method
Slide 13
12 / 35 a1 a2 a3 a4 a5 a6 a7 a8 a9 frame b1 b1 b2 b3 c c
dimension b1 b2 b3 c c c c c c c c c c c c c c c c c c Input to the
GMM classifier Assume there are 9 frames for an utterance
Slide 14
13 / 35 Combine frame-level likelihoods with the
segment/utterance-level features Frame-level Feature Segment-level
Feature Segment-level Feature Utterance- level Feature Utterance-
level Feature Frame-level HMMs Utterance-level Feature
Utterance-level Feature Input to the final classifier Frame-level
Likelihoods of Each Emotion Segment-level Feature Vlasenko et al.
approach The proposed PRFC method
Slide 15
14 / 35 FL likelihood FL likelihood Input to the GMM classifier
Segment 1 feature FL likelihood FL likelihood Segment 2 feature
Segment 3 feature FL features FL features FL HMMs SL features SL
features UL features UL features UL feature FL likelihood FL
likelihood FL likelihood FL likelihood UL feature Frame-level
features are transformed to utterance-level features
17 / 35 Sequential Forward Feature Selection (SFFS) Start from
an empty feature set and sequentially include a feature into the
selected set Evaluate the performance of each set with different
classifiers, then pick the best feature set Frame-level features
(HMM classifier,3 states, 8 components ) Time consuming for FL
features to do SFFS Any feature other than MFCC showed a decrease
in accuracy
Slide 19
18 / 35 Segment-level features (GMM classifier, 32 component)
1-segment : Reduce the dimension of feature space from 384
(16X2X12) to 189 2-segment : Reduce the dimension of feature space
from 384 to 253 3-segment : Reduce the dimension of feature space
from 384 to 221 Utterance-level features (GMM classifier, 32
component) Reduce the dimension of feature space from 384 to
230
Slide 20
19 / 35 Leave-one-speaker-out for EMODB
Four-fold-cross-validation for eNTERFACE FoldGender 12 females +8
males 2 32 females +9 males 4
22 / 35 Likelihood Combination Frame-level feature has the best
recognition rate in the three individual feature set in EMODB
Combination of different timing level features shows improvement
Segment-level feature do not perform well in eNTERFACE
Slide 24
23 / 35 Find the optimized weighting factors by brute force
Dual-timing-level likelihoods 1 + 2 = 1 Tuning gap = 0.01 100
iterations Trio-timing-level likelihoods 1 + 2 + 3 = 1 Tuning gap =
0.01 10000 iterations (100*100)
Slide 25
24 / 35 Weighted Likelihood Combination first classifier: HMM,
GMM
26 / 35 Weighted Likelihood Combination The combination of 3
timing-level likelihoods didnt show much improvement By combining
different timing-level with weighting factors did show improvement,
but not much The weighting factor for frame-level likelihood is
always the biggest
Slide 28
27 / 35 Find the optimized Gaussian mixture component
64(EMODB)/32(eNTER), for the final classifier GMM, FL+SL+UL O
O
Slide 29
28 / 35 Raw Feature Combination Final classifier: GMM (mix.
comp.=64/32)
Slide 30
29 / 35 Raw Feature Combination Preserve the most information
of different timing levels feature Too time-consuming (approx. 5hr,
LOSO, GMM, EMODB), not efficient
Slide 31
30 / 35 Find the optimized Gaussian mixture component 256
(EMODB)/32 (eNTER), for the final classifier - GMM, FL+SL+UL O
O
Slide 32
31 / 35 Partial Raw Feature Combination Final classifier: GMM
(mix. comp.=256/32)
Slide 33
32 / 35 Partial Raw Feature Combination Have less feature
dimension than RFC, so the computation time is less, but still long
(approx. 3-4 hrs, LOSO, EMODB, GMM) Trio-timing-level features
still show better performance than single timing-level and dual
timing-level combination
Slide 34
33 / 35 Noise in eNTERFACE EMODB eNTERFACE Der Lappen liegt auf
dem Eisschrank. (The tablecloth is lying on the frigde.) No, no,
no! I need this money!
Slide 35
34 / 35 Likelihood combination method shows the best
recognition result Trio-timing level features achieve the best
recognition rate Different timing-level features could compensate
each other Frame-level features play the most important role
Slide 36
35 / 35 Variable size of segment Word? Syllable? An automatic
segmentation way Learning to rank approach Map the emotion to
2-dimension Give a ranking label for each emotion