Upload
jesus-gillham
View
217
Download
0
Embed Size (px)
Citation preview
11
EmotionEmotionand and
SpeecSpeechhTechniques, models and resultsTechniques, models and results
Facts, fiction and opinionsFacts, fiction and opinionsPast present and futurePast present and future
Acted, spontaneous, recollectedActed, spontaneous, recollectedIn Asia Europe and AmericaIn Asia Europe and America
And the middle eastAnd the middle east
HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004
22
OverviewOverview A short introduction to speech A short introduction to speech
sciencescience … … and speech analysis toolsand speech analysis tools Speech and emotion:Speech and emotion:
Models, problems ... andModels, problems ... and ResultsResults
A review of open issues A review of open issues Deliverables within the HUMAINE Deliverables within the HUMAINE
frameworkframework
44
A short introduction to A short introduction to SPEECH:SPEECH:
Most of those present here are familiar Most of those present here are familiar with various aspects of signal with various aspects of signal processingprocessing
For the benefit of those who aren’t For the benefit of those who aren’t acquainted with the acquainted with the speech signalspeech signal in in particular:particular: We’ll start with an overview of speech We’ll start with an overview of speech
production models and analysis techniquesproduction models and analysis techniques The rest of you can sleep for a few The rest of you can sleep for a few
minutesminutes
55
The speech signalThe speech signal A 1-D signalA 1-D signal
Does that make it a simple one? NO…Does that make it a simple one? NO… There are many analysis techniques There are many analysis techniques Like many types of systems - Like many types of systems - parametric parametric
modelsmodels are one very useful here… are one very useful here… A simple and very useful A simple and very useful speech speech
productionproduction model:model: the the source/filtersource/filter model model
(in case you’re worried, we’ll see that (in case you’re worried, we’ll see that this is directly related to emotions also)this is directly related to emotions also)
66
The source/filter modelThe source/filter model Components: Components:
The lungs (create air pressure)The lungs (create air pressure) Two elements that turn this Two elements that turn this
into a “raw” signal:into a “raw” signal: The vocal folds (periodic signals)The vocal folds (periodic signals) Constrictions that make the Constrictions that make the
airflow turbulent (noise)airflow turbulent (noise) The vocal tract
Partly immobile: upper jaw, teeth Partly mobile: soft palate, tongue,
lips, lower jaw – also called “articulators”
Its influence on the raw signal can be modeled very will with a low order (~10) digital filter
source
filter
77
The net resultThe net result:: A complex signal that changes its properties A complex signal that changes its properties
constantly:constantly: Sometimes periodicSometimes periodic Sometimes colored noiseSometimes colored noise Approximately stationary Approximately stationary
over time windows of ~20 over time windows of ~20 millisecondsmilliseconds
And of course – And of course – contains a great deal of informationcontains a great deal of information TextText – – linguistic information linguistic information Other stuff Other stuff – – paralinguistic informationparalinguistic information
Speaker identitySpeaker identity GenderGender Socioeconomic backgroundSocioeconomic background Stress, accentStress, accent Emotional stateEmotional state Etc. …Etc. …
88
How is this information How is this information codedcoded?? Textual informationTextual information - -
mainly in the mainly in the filterfilter and the way it changes and the way it changes its properties over timeits properties over time Filter “snapshots” are called Filter “snapshots” are called segmentssegments
Paralinguistic informationParalinguistic information – – mainly in the mainly in the sourcesource parameters parameters Lung pressure – determines the intensityLung pressure – determines the intensity Vocal fold periodicity – determines Vocal fold periodicity – determines
instantaneous frequency or “pitch”instantaneous frequency or “pitch” Configuration of the glottis determines overall Configuration of the glottis determines overall
spectral tilt – “voice quality”spectral tilt – “voice quality”
99
ProsodyProsody:: Prosody is another name for Prosody is another name for partpart of the of the
paralinguistic information, composed of:paralinguistic information, composed of: IntonationIntonation – the way in which pitch changes – the way in which pitch changes
over timeover time IntensityIntensity – changes in intensity over time – changes in intensity over time
Problem: some segments are inherently weaker than Problem: some segments are inherently weaker than othersothers
RhythmRhythm – segment durations vs. time – segment durations vs. time
Prosody does not include Prosody does not include voice qualityvoice quality, but , but voice quality is also part of the voice quality is also part of the paralinguistic informationparalinguistic information
1010
To summarizeTo summarize:: Speech scienceSpeech science is at a mature stage is at a mature stage The source/filter model is very useful in The source/filter model is very useful in
understanding speech productionunderstanding speech production Many applications (speech recognition, Many applications (speech recognition,
speaker verification, speaker verification, emotion recognition,emotion recognition, etc.) require etc.) require extraction of the model extraction of the model parametersparameters from the speech signal (an from the speech signal (an inverse problem)inverse problem)
This is the domain of:This is the domain of:speech analysis techniquesspeech analysis techniques
1212
The large picture: The large picture: speech analysis in the HUMAINE speech analysis in the HUMAINE frameworkframework
Speech Speech analysisanalysis is just one component is just one component in the context of speech and emotion:in the context of speech and emotion:
TheoryOf
emotion
Trainingdata
SpeechAnalysisengine
Realdata
HighLevel
application
Its overall objectives:Its overall objectives: Calculate raw speech Calculate raw speech
parametersparameters Extract features salient Extract features salient
to emotional contentto emotional content Discard irrelevant Discard irrelevant
featuresfeatures Use them to characterize Use them to characterize
and maybe classify and maybe classify emotional speechemotional speech
1313
Signals to Signs - The Signals to Signs - The processprocess
Data Warehouse
DataRepresentation
DatabasesFiles
Data Cleaning and Integration
Selection and Transformation
Patterns
Data Mining
Knowledge
Evaluation and Presentation
1414
S2S S2S ((SOS…?SOS…?) ) - The tools- The tools
a combination of techniques that a combination of techniques that belong to different types of disciplines:belong to different types of disciplines: Data warehouse technologies (data Data warehouse technologies (data
storage, information retrieval, query storage, information retrieval, query answering, etc’)answering, etc’)
Data preprocessing and handlingData preprocessing and handling Data modeling / visualization Data modeling / visualization Machine learning (statistical data Machine learning (statistical data
analysis, pattern recognition, information analysis, pattern recognition, information retrieval, etc’)retrieval, etc’)
1515
The objective of speech The objective of speech analysis techniquesanalysis techniques
1.1. To To extract the raw model parametersextract the raw model parameters from the speech signalfrom the speech signal
Interfering factors:Interfering factors: Reality never exactly fits the modelReality never exactly fits the model Background noise Background noise Speaker overlapSpeaker overlap
2.2. To extract To extract featuresfeatures
3.3. To To interpret them interpret them in meaningful ways in meaningful ways (pattern recognition)(pattern recognition)
Really hard!Really hard!
1616
It remains thatIt remains that- -
Useful models and techniques exist for extracting Useful models and techniques exist for extracting the various information types from the speech the various information types from the speech signalsignal
Yet …Yet …Many applications Many applications such as speech recognition, such as speech recognition, speaker identification, speech synthesis, etc., speaker identification, speech synthesis, etc., are are far from being perfectedfar from being perfected
… … So what about So what about emotion?emotion?
1717
For the moment – For the moment – let’s focus on the let’s focus on the smallsmall picturepicture The consensus is that emotions are coded The consensus is that emotions are coded
inin ProsodyProsody Voice qualityVoice quality And sometimes in the textual informationAnd sometimes in the textual information
Let’s discuss the purely technical aspects Let’s discuss the purely technical aspects of evaluating all of these …of evaluating all of these …
1818
Extracting features from the Extracting features from the speech signalspeech signal
Stage 1Stage 1 – – Extracting raw featuresExtracting raw features:: PitchPitch IntensityIntensity Voice qualityVoice quality PausesPauses Segmental information – phones and their Segmental information – phones and their
durationduration TextText
(by the way …(by the way …who extracts them – man, who extracts them – man, machine or both? machine or both? ))
1919
PitchPitch Pitch: The instantaneous frequencyPitch: The instantaneous frequency
Sounds deceptively simple to find – but it isn’t!Sounds deceptively simple to find – but it isn’t! Lots of research has been devoted to pitch Lots of research has been devoted to pitch
detectiondetection Composed of two sub-problems:Composed of two sub-problems:
For a given signal – is there periodicity at all?For a given signal – is there periodicity at all? If so – what’s the fundamental frequency?If so – what’s the fundamental frequency?
Complicating factors:Complicating factors: Speaker related factors – hoarseness, diplophony, etc.Speaker related factors – hoarseness, diplophony, etc. Background related factors – noise, overlapping speakers, Background related factors – noise, overlapping speakers,
filters (as in telephony)filters (as in telephony) In the context of emotions:In the context of emotions:
Small errors are acceptableSmall errors are acceptable Large errors (octave jumps, false positives) are Large errors (octave jumps, false positives) are
catastrophiccatastrophic
2121
IntensityIntensity Appears to be even simpler than pitch!Appears to be even simpler than pitch! Intensity is quite easy to measure …Intensity is quite easy to measure …
Yet most influenced by unrelated factors!Yet most influenced by unrelated factors! Aside from the speaker, intensity is Aside from the speaker, intensity is
gravely affected by:gravely affected by: Distance from the microphoneDistance from the microphone Gain settings in the recording equipmentGain settings in the recording equipment
ClippingClipping AGCAGC
Background noiseBackground noise Recording environmentRecording environment
Without normalization – intensity is almost Without normalization – intensity is almost useless!useless!
2222
Voice qualityVoice quality Several measures are used to measure it:Several measures are used to measure it:
Local irregularity in pitch and intensityLocal irregularity in pitch and intensity Ratio between harmonic components and noise Ratio between harmonic components and noise
componentscomponents Distribution of energy in the spectrumDistribution of energy in the spectrum
Affected by a multitude of factors other Affected by a multitude of factors other than emotionsthan emotions
Some standardized measures are often Some standardized measures are often used in clinical applicationsused in clinical applications
A large factor in emotional speech!A large factor in emotional speech!
2323
SegmentsSegments
There are different ways of defining There are different ways of defining precisely what these areprecisely what these are
Automatic segmentation is difficult, Automatic segmentation is difficult, though not as difficult as speech though not as difficult as speech recognitionrecognition
Even the segment boundaries can Even the segment boundaries can give important timing information, give important timing information, related to rhythm – related to rhythm – an important component of prosodyan important component of prosody
2424
TextText Is this “raw” data or not?Is this “raw” data or not? Is it dataIs it data … … at allat all??
Some studies on emotion specifically eliminated Some studies on emotion specifically eliminated this factor (this factor (filtered speech, uniform textsfiltered speech, uniform texts))
Other studies are interested Other studies are interested mainlymainly in text in text If we want to deal with text, we must keep in If we want to deal with text, we must keep in
mind:mind: Automated speech recognition is HARD!Automated speech recognition is HARD!
Especially with strong background noiseEspecially with strong background noise Especially when strong emotions are present, modifying Especially when strong emotions are present, modifying
the speakers normal voices and mannerismsthe speakers normal voices and mannerisms Especially when dealing with multiple speakersEspecially when dealing with multiple speakers
2525
Some complicating factors in Some complicating factors in raw feature extractionraw feature extraction::
Background noiseBackground noise Speaker overlapSpeaker overlap Speaker variabilitySpeaker variability Variability in recording equipmentVariability in recording equipment
2626
In the general context of In the general context of speech analysisspeech analysis- -
The raw features we discussed are The raw features we discussed are not specific only to the study of not specific only to the study of emotionemotion
YetYet – issues related to calculating – issues related to calculating them them reliablyreliably crop up again and crop up again and again in emotion related studiesagain in emotion related studies
Some standard and reliable tools Some standard and reliable tools would be very helpfulwould be very helpful
2727
Two opposing approaches to Two opposing approaches to computing raw features:computing raw features:
Assume we have perfect Assume we have perfect algorithmsalgorithms for extracting all this for extracting all this informationinformation If we don’tIf we don’t – help out – help out manuallymanually This can be carried out only over small This can be carried out only over small
databasesdatabases Useful in purely theoretical studiesUseful in purely theoretical studies
Acknowledge we only have Acknowledge we only have imperfect algorithmsimperfect algorithms Find how to deal Find how to deal automaticallyautomatically with with
imperfect dataimperfect data Very important for large databasesVery important for large databases
Ideal
Errorprone
Real life
2828
NextNext - what do we do with it - what do we do with it allall??
Reminder:Reminder: we have large amounts of we have large amounts of raw dataraw data
Now we have to make some meaning Now we have to make some meaning from itfrom it
2929
Feature extractionFeature extraction… …
Stage 2 Stage 2 – – data reductiondata reduction:: Take a sea of numbersTake a sea of numbers Reduce it to a small number of Reduce it to a small number of
meaningful measuresmeaningful measures Prove they’re meaningful Prove they’re meaningful
An interesting way to look at it:An interesting way to look at it: Separating the “Separating the “signalsignal” (e.g emotion) ” (e.g emotion)
from the “from the “noisenoise” (anything else) ” (anything else)
3030
An example of “Noise”:An example of “Noise”: Here pitch Here pitch
and and intensity intensity have totally have totally unemotionunemotional (but al (but important) important) roles:roles:[Deller et [Deller et al]al]
3131
Examples of high level Examples of high level featuresfeatures
Pitch fitting – Pitch fitting – stylizationstylization MoMel MoMel Parametric modeling Parametric modeling statisticsstatistics
34
One way to extract the essential information:
Time (s)0 3.39769
0
500
Time (s)0 3.39769
0
500
Pitch stylization – IPO method
3737
Some observationsSome observations::
Different parameterizations give Different parameterizations give different curvesdifferent curves different featuresdifferent features
Yet: perceptually – they are all very Yet: perceptually – they are all very similarsimilar
3838
QuestionsQuestions:: We can ask what is the We can ask what is the minimalminimal or or
most representativemost representative information to information to capture the pitch contour?capture the pitch contour?
More importantly, though:More importantly, though:What aspects of the pitch contour What aspects of the pitch contour are most relevant to emotion?are most relevant to emotion?
3939
Several answers appear in the Several answers appear in the literatureliterature::
Statistical features taken from the Statistical features taken from the raw contour:raw contour: Mean, variance, max, min, range etc.Mean, variance, max, min, range etc.
Features taken from parameterized Features taken from parameterized contourscontours:: Slopes, “main” peaks and dips etc.Slopes, “main” peaks and dips etc.
4040
There’s not much time to go There’s not much time to go intointo::
Intensity contoursIntensity contours SpectraSpectra DurationDuration
But the problems are very similarBut the problems are very similar
4141
The importance of time The importance of time framesframes We have several measures that vary over timeWe have several measures that vary over time Over what time frame should we consider Over what time frame should we consider
them?them?
The meaning we attribute to speech The meaning we attribute to speech parameters is dependent on the time frame parameters is dependent on the time frame over which they’re considered:over which they’re considered: Fixed length windowsFixed length windows PhonesPhones WordsWords ““Intonation units”Intonation units” ““Tunes”Tunes”
4242
Which time frame is bestWhich time frame is best?? Fixed time framesFixed time frames of several seconds – simple to of several seconds – simple to
implement, but naïveimplement, but naïve Very arbitraryVery arbitrary
WordsWords Need a recognizer to be markedNeed a recognizer to be marked Probably the shortest meaningful frameProbably the shortest meaningful frame
““Intonation unitsIntonation units”” Nobody knows exactly what they are (one “idea” per Nobody knows exactly what they are (one “idea” per
unit?)unit?) Hard to measureHard to measure Correlate best with coherent stretches of speechCorrelate best with coherent stretches of speech
““TunesTunes” – from one pause to the next” – from one pause to the next feasible to implementfeasible to implement Correlate to some extent with coherent stretches of Correlate to some extent with coherent stretches of
speech.speech.
4343
Why is this such an important Why is this such an important decisiondecision??
It might help us interpret our data It might help us interpret our data correctly!correctly!
4444
Therefore …Therefore …the problem of feature the problem of feature
extractionextraction:: Is NOT a general oneIs NOT a general one We want features that are We want features that are
specifically relevant to specifically relevant to emotional emotional contentcontent … …
But before we get to that -But before we get to that -we have:we have:
4545
The The Data Mining Data Mining partpart
Stage 3: Stage 3: ToTo extract extract knowledgeknowledge = previously = previously
unknown informationunknown information (rules, (rules, constraints, regularities, constraints, regularities, patterns, etc’) from the patterns, etc’) from the
features databasefeatures database
4646
What are we miningWhat are we mining?? We look for patterns that either We look for patterns that either describedescribe the stored the stored
data data
or or inferinfer from it (predictions)from it (predictions)
slope 20pause 30accent 1 15accent 2 30duration 5
slope
pause
accent 1
accent 2
duration
Summarization and characterization (of the class of data that interests us)
Discrimination and comparison of features of different classes
Eran Rafi Haim Yuvalbefore gamble 20 25 20 15after gamble 10 18 15 15
Eran Rafi Haim Yuval
before gamble
after gamble0
5
10
15
20
25
before gamble
after gamble
4747
Types of AnalysisTypes of Analysis Association analysisAssociation analysis of rules of the form of rules of the form X X
=> Y=> Y((DB tuples that satisfy DB tuples that satisfy X X are likely to satisfy are likely to satisfy Y)Y)where where X X and and Y Y are pairs of attribute and are pairs of attribute and value/set of valuesvalue/set of values
Classification and class predictionClassification and class prediction – – find a find a set of functions to describe and distinguish set of functions to describe and distinguish data classes/concepts that can be used data classes/concepts that can be used predict the class of unlabeled data.predict the class of unlabeled data.
Cluster analysis (unsupervised Cluster analysis (unsupervised clustering)clustering) – – analyze the data when there analyze the data when there are no class labels to deal with new types of are no class labels to deal with new types of data and help group similar events togetherdata and help group similar events together
4848
Association RulesAssociation Rules We search for We search for interestinginteresting relationships relationships
among items in the data among items in the data Interestingness Measures: Interestingness Measures:
BA •Support = # tuples that contain both A and B /
# tuples •Confidence = # tuples that contain both A and B /
# tuples that contain ASupport measures usefulness )( BAP
Confidence measures certainty )|( ABP
4949
ClassificationClassificationA two step process:
1. Use data tuples with known labels to construct a model
2. Use the learned model to classify (assign labels) new data
Since the class label of each training sample is known, this is Supervised Learning
Test data is used to estimate the predictive accuracyof the learned model.
Data is divided into two groups: training data and test data
5050
AssetsAssets No need to know the rules in advanceNo need to know the rules in advance Some rules are not easily formulated Some rules are not easily formulated
as mathematical or logical expressionsas mathematical or logical expressions Similar to one of the ways human learn Similar to one of the ways human learn Could be more robust to noise and Could be more robust to noise and
incomplete dataincomplete data May require a lot of samplesMay require a lot of samples Learning depends on existing data Learning depends on existing data
only!only!
5151
Algorithms:Algorithms: Machine learning (Statistical learning)Machine learning (Statistical learning) Expert systemsExpert systems Computational neuroscienceComputational neuroscience
Dangers:Dangers: The model might not be able to learnThe model might not be able to learn There might not be enough dataThere might not be enough data Over-fitting the model to the training Over-fitting the model to the training
datadata
5252
PredictionPrediction
Classification predicts categorical labelsClassification predicts categorical labels PredictionPrediction models continuous valued models continuous valued
functionfunction It is usually used to predict the value or It is usually used to predict the value or
a range of values of an attribute of a a range of values of an attribute of a given samplegiven sample RegressionRegression Neural NetworksNeural Networks
5353
ClusteringClustering constructing models for assigning class labels constructing models for assigning class labels
to data that is unlabeled.to data that is unlabeled. un supervised learningun supervised learning Clustering is an Clustering is an ill definedill defined task task Once clusters are discovered, the clustering Once clusters are discovered, the clustering
model can be used for predicting labels of model can be used for predicting labels of new datanew data
Alternatively, the clusters can be used as Alternatively, the clusters can be used as labels to train a supervised classification labels to train a supervised classification algorithmalgorithm
5454
So how does this technicalSo how does this technicalMumbo JumboMumbo Jumbo
tie into -tie into -
5656
Speech and emotionSpeech and emotion Emotion can affect speech in many waysEmotion can affect speech in many ways
ConsciouslyConsciously Unconsciously Unconsciously Through the Autonomous nervous systemThrough the Autonomous nervous system Examples:Examples:
Textual contentTextual content is usually consciously chosen, except is usually consciously chosen, except maybe sudden interjections which may stem from maybe sudden interjections which may stem from sudden or strong emotionssudden or strong emotions
Many speech patterns related to emotions are Many speech patterns related to emotions are strongly strongly ingrainedingrained – therefore, though they – therefore, though they cancan be be controlled by the speaker, most often they are not, controlled by the speaker, most often they are not, unless the speaker tries modify them consciouslyunless the speaker tries modify them consciously
Certain speech characteristics are affected by the Certain speech characteristics are affected by the degree of arousal, and therefore nearly impossible to degree of arousal, and therefore nearly impossible to inhibit (e.g. inhibit (e.g. vocal tremorvocal tremor due to grief) due to grief)
5757
Speech analysis: the big Speech analysis: the big picture - againpicture - again
Speech analysis is just one component in Speech analysis is just one component in the context of speech and emotion:the context of speech and emotion:
ApplicationRealdata Speech analysis
Theories of emotion
Databases
5858
Is this just another way to Is this just another way to spread the blamespread the blame??
Us speech analysis guys are just poor little Us speech analysis guys are just poor little engineersengineers
Methods we can supply can be no better than Methods we can supply can be no better than the the theorytheory and the and the datadata that drive them that drive them
… … and unfortunately, the jury is still out on and unfortunately, the jury is still out on both of those points … both of those points … or notor not??
Ask WP3 and WP5 peopleAsk WP3 and WP5 people They’re here somewhere They’re here somewhere
Actually –Actually – One of the difficulties HUMAINE is intended to ease, One of the difficulties HUMAINE is intended to ease,
is that often researchers in the field find themselves is that often researchers in the field find themselves having to address having to address allall of the above! ( of the above! (guiltyguilty))
5959
The most fundamental The most fundamental problemproblem::
WhatWhat are the features that signify emotion? To are the features that signify emotion? To paraphrase – what signals are signs of paraphrase – what signals are signs of emotion?emotion?
6060
The most common The most common solutionssolutions::
Calculate as many as you can think Calculate as many as you can think ofof
IntuitionIntuition Theory based answersTheory based answers Data-driven answersData-driven answers
Ha! Once more – it’s not our fault!Ha! Once more – it’s not our fault!
6161
What seems to be the most What seems to be the most plausible approachplausible approach - -
The data driven approachThe data driven approach
Requiring:Requiring: Emotional speech databases (“corpora”)Emotional speech databases (“corpora”) Perceptual evaluation of these databasesPerceptual evaluation of these databases
This is then correlated with speech This is then correlated with speech featuresfeatures Which takes us back to a previous squareWhich takes us back to a previous square
6262
So tell us already – how does So tell us already – how does emotion influence speechemotion influence speech??
… … It seems that the answer depends It seems that the answer depends on on howhow you look for it you look for it
As hinted before – the answer cannot As hinted before – the answer cannot really be separated from:really be separated from: The theories of emotionThe theories of emotion The databases we have of emotional The databases we have of emotional
speech -speech - Who the subjects areWho the subjects are How emotion was elicitedHow emotion was elicited
63
A short digression -
Will all the speech clinicians in the audience please stand up?
Hmm…. We don’t seem to have so many
Let’s look at what one of them has to say
64
Emotions in the speech Clinic
Some speakers have speech/voice problems that modify their “signal”, thus misleading the listener
VOICE – People with vocal instability (high jitter/shimmer/tremor are clinically perceived as nervous (although the problems reflect irregularity in the vocal folds).
- Breathy voice (in women) is, sometimes, perceived as “sexy” (while it actually reflects incomplete adduction of the vocal folds).
- Higher excitation level leads to vocal instability (high jitter/shimmer/ tremor)
65
Clinical Examples:
STUTTERING – listeners judge people who stutter as nervous, tensed, and less confident (identification of stuttering depends on pause duration within the “repetition units”, and on rate of repetitions).
CLUTTERING – listeners judge cluttering people as nervous and less intelligent
6666
So- So- though this is a WP4 meetingthough this is a WP4 meeting… …
It’s impossible to avoid talking about WP3 It’s impossible to avoid talking about WP3 (theory of emotion) and WP5 (databases) (theory of emotion) and WP5 (databases) issuesissues
The signs we’re looking for can never be The signs we’re looking for can never be separated from the questions:separated from the questions: Signs Signs ofof what (emotions)? what (emotions)? Signs Signs inin what (data)? what (data)?
May God and Phillipe Gelin forgive me …May God and Phillipe Gelin forgive me …
6767
A not-so-old example:A not-so-old example:(Murray and Arnott, 1993)(Murray and Arnott, 1993)
Very qualitativeVery qualitative Presupposes dealing with primary emotionsPresupposes dealing with primary emotions
6868
BUTBUT… … If you expect more If you expect more
recent results to give recent results to give more detailed more detailed descriptivedescriptive outlines outlines Then you’re wrongThen you’re wrong
The data-driven The data-driven approaches use a approaches use a large number of large number of features, and let the features, and let the computer sort them computer sort them outout 32 significant features 32 significant features
found by ASSESS, from found by ASSESS, from the initial 375 usedthe initial 375 used
5 emotions, acted5 emotions, acted 55% recognition55% recognition
6969
Some remarksSome remarks:: Some features are Some features are indicativeindicative, even though , even though
we probably don’t use them we probably don’t use them perceptuallyperceptually e.g. e.g. pitch meanpitch mean: usually this is raised with : usually this is raised with
higher activationhigher activation But we don’t have to know the speaker’s But we don’t have to know the speaker’s
neutral mean to perceive heightened neutral mean to perceive heightened activationactivation
My guess:My guess: voice quality is what we perceive in voice quality is what we perceive in such casessuch cases
How “simple” can characterization of How “simple” can characterization of emotions become?emotions become? How many features do we listen for?How many features do we listen for? Can this be verified?Can this be verified?
7070
Time intervalsTime intervals This issue becomes more and more This issue becomes more and more
important as we go towards “natural” important as we go towards “natural” datadata
Emotion production:Emotion production: How long do emotions last?How long do emotions last?
Full blown emotions are usually short (Full blown emotions are usually short (but not but not always! Look at Peguy in the LIMSI interview always! Look at Peguy in the LIMSI interview databasedatabase))
Moods, or pervasive emotions are subtle but Moods, or pervasive emotions are subtle but long lastinglong lasting
Emotion Analysis:Emotion Analysis: Over what span of speech are they Over what span of speech are they
easiest to detect?easiest to detect?
7171
From the analysis viewpointFrom the analysis viewpoint:: Current efforts seem to be focusing on Current efforts seem to be focusing on
methods that aim to use time spans that methods that aim to use time spans that have some have some inherent meaninginherent meaning:: Acoustically (ASSESS – Cowie et al)Acoustically (ASSESS – Cowie et al) Linguistically (Batliner et al)Linguistically (Batliner et al)
We mentioned that prosody carries We mentioned that prosody carries emotional information (our “signal”) emotional information (our “signal”) other information (“noise”): phrasing, various other information (“noise”): phrasing, various
types of prominencetypes of prominence
BUT …BUT …
7272
Why I like intonation unitsWhy I like intonation units Spontaneous speech is organized differently from Spontaneous speech is organized differently from
written languagewritten language ““sentences” and “paragraphs” don’t really exist theresentences” and “paragraphs” don’t really exist there PhrasingPhrasing is a loose phrase for … is a loose phrase for …”Intonation units””Intonation units”
Theoretical linguists love to discuss what they areTheoretical linguists love to discuss what they are An exact definition is as hard to find as it is to parse spontaneous An exact definition is as hard to find as it is to parse spontaneous
speechspeech Prosodic markers help replace various written markersProsodic markers help replace various written markers Maybe emotion is not an “orthogonal” bit of Maybe emotion is not an “orthogonal” bit of
information on top of these (the signal+noise model)information on top of these (the signal+noise model) If emotion If emotion modifiesmodifies these, these,
It would be very useful if we could identify the prosodic It would be very useful if we could identify the prosodic markers we use and the ways we modify them when we’re markers we use and the ways we modify them when we’re emotionalemotional
Problem: Problem: Engineers don’t like ill defined Engineers don’t like ill defined concepts!concepts! But emotion is one of them too, isn’t it?But emotion is one of them too, isn’t it?
7373
Just to provoke some Just to provoke some thoughtthought::
From a paper on From a paper on animationanimation ((think of it – these guys have to integrate speech think of it – these guys have to integrate speech and image to make them fit naturally)and image to make them fit naturally)::“… “… speech consists of a sequence of intonation speech consists of a sequence of intonation phrases. Each intonation phrase is realized with phrases. Each intonation phrase is realized with fluid, continuous articulation and a single point of fluid, continuous articulation and a single point of maximum emphasis. Boundaries between maximum emphasis. Boundaries between successive phrases are associated with perceived successive phrases are associated with perceived disjuncture and are marked in English with cues disjuncture and are marked in English with cues such as pitch movements … Gestures are such as pitch movements … Gestures are performed in units that coincide with these performed in units that coincide with these intonation phrases, and points of prominence in intonation phrases, and points of prominence in gestures also coincide with the emphasis in the gestures also coincide with the emphasis in the concurrent speech…” concurrent speech…” [Stone et al., SIGGRAPH 2004][Stone et al., SIGGRAPH 2004]
7474
We haven’t even discussed We haven’t even discussed WP3 issuesWP3 issues- -
What are the scales/categories?What are the scales/categories? Possibility 1Possibility 1: : emotional labelingemotional labeling Possibility 2Possibility 2: : psychological scalespsychological scales (such (such
as valence/activation – e.g. Feeltrace)as valence/activation – e.g. Feeltrace)
QUESTION:QUESTION: Which is more directly related to Which is more directly related to
speech features?speech features?Hopefully we’ll hammer out a tentative answer by Hopefully we’ll hammer out a tentative answer by
Tuesday..Tuesday..
7676
Evaluating resultsEvaluating results Results often demonstrate how elusive the Results often demonstrate how elusive the
solution is …solution is … Consider a similar problem: Consider a similar problem: Speech Speech
RecognitionRecognition To evaluate results – To evaluate results –
Make recordingsMake recordings Submit them to an algorithmSubmit them to an algorithm Measure the recognition rate!Measure the recognition rate!
Emotion recognition Emotion recognition results are far more results are far more difficult to quantifydifficult to quantify Heavily dependent on induction techniques Heavily dependent on induction techniques
and labeling methodsand labeling methods
7777
Several popular contextsSeveral popular contexts:: Acted prototypical emotionsActed prototypical emotions Call center dataCall center data
RealReal WoZ typeWoZ type
Media (radio, TV) based dataMedia (radio, TV) based data Narrative speech (event recollection)Narrative speech (event recollection) Synthesized speech (Synthesized speech (monterro, goblmonterro, gobl))
Most of these methods can be placed on the Most of these methods can be placed on the spectrum between:spectrum between: Acted, full blown bursts of stereotypical emotionsActed, full blown bursts of stereotypical emotions Fully natural, mixtures of mood, affect and bursts Fully natural, mixtures of mood, affect and bursts
of difficult-to-label emotions recorded in noisy of difficult-to-label emotions recorded in noisy environmentsenvironments
7878
Call centersCall centers
A real life scenario! (A real life scenario! (with commercial with commercial interestsinterests…)!…)!
Sparse emotional content:Sparse emotional content: Controlled (usually)Controlled (usually) Negative (usually)Negative (usually)
Lends itself easily to WOZ scenariosLends itself easily to WOZ scenarios
7979
Ang et al., 2002Ang et al., 2002 Standardized call-center data from 3 Standardized call-center data from 3
different sourcesdifferent sources Uninvolved users, true HMI interactionUninvolved users, true HMI interaction Detects neutral/annoyance/frustrationDetects neutral/annoyance/frustration Mostly automatic extraction, with some Mostly automatic extraction, with some
additional human labelingadditional human labeling Defines human “accuracy” as 75%Defines human “accuracy” as 75%
But this is actually the percentage of human But this is actually the percentage of human consensusconsensus
Machine accuracy is comparableMachine accuracy is comparable A possible measure: maybe “accuracy” is A possible measure: maybe “accuracy” is
where users wanted human interventionwhere users wanted human intervention
8080
Batliner et alBatliner et al.. Professional acting, amateur acting, WOZ Professional acting, amateur acting, WOZ
scenarioscenario the latter with uninvolved users, true HMI the latter with uninvolved users, true HMI
interactioninteraction Detects Detects trouble in communicationtrouble in communication
Much thought was given to this definitionMuch thought was given to this definition!! Combines prosodic features with others:Combines prosodic features with others:
POS labelsPOS labels Syntactic boundariesSyntactic boundaries
Overall – shows a typical result:Overall – shows a typical result: The closer we get to “real” scenarios, the more The closer we get to “real” scenarios, the more
difficult the problem becomes!difficult the problem becomes! Up to 95% on acted speechUp to 95% on acted speech Up to 79% on read speechUp to 79% on read speech Up to 73% on WOZ dataUp to 73% on WOZ data
8181
Devillers et alDevillers et al.. RealReal call center data call center data
Contains also fear (of losing money!)Contains also fear (of losing money!) Human – human interaction, involved usersHuman – human interaction, involved users Human accuracy of 75% is reportedHuman accuracy of 75% is reported
Is this, as in Ang, the degree of human Is this, as in Ang, the degree of human agreement?agreement?
Use a Use a smallsmall number of intonation features number of intonation features Treat pauses and filled pauses separatelyTreat pauses and filled pauses separately
Some results:Some results: Different behavior between clients and agents, Different behavior between clients and agents,
males and femalesmales and females Was classification attempted also?Was classification attempted also?
8282
Games and simulatorsGames and simulators These provide an extremely These provide an extremely
interesting setting interesting setting Participants can often be found to Participants can often be found to
experience real emotionsexperience real emotions The experimenter can sometimes The experimenter can sometimes
control these to a certain extent control these to a certain extent Such as driving conditions or additional Such as driving conditions or additional
tasks in a driving simulatortasks in a driving simulator
8383
Fernandez & Picard (2000)Fernandez & Picard (2000)
Subjects did math problems while Subjects did math problems while driving a simulatordriving a simulator This was supposed to induce stressThis was supposed to induce stress
Spectral features were usedSpectral features were used No prosody at all!No prosody at all!
Advanced classifiers were appliedAdvanced classifiers were applied Results were inconsistent across users, Results were inconsistent across users,
raising a familiar question:raising a familiar question: Is it the classifier, or is it the data?Is it the classifier, or is it the data?
8484
Kehrein (2002)Kehrein (2002) 2 subjects in 2 separate rooms:2 subjects in 2 separate rooms:
One had instructionsOne had instructions One had a set of Lego building blocksOne had a set of Lego building blocks The first had to explain to the other The first had to explain to the other
what to constructwhat to construct A wide range of “A wide range of “naturalnatural” emotions ” emotions
was reportedwas reported His thesis is in German His thesis is in German No classification was attempted No classification was attempted
8585
Acted speechActed speech
Widely usedWidely used An ever-recurring question:An ever-recurring question:
Does it reflect the way emotions are Does it reflect the way emotions are expressed in spontaneous speech?expressed in spontaneous speech?
8686
McGilloway et alMcGilloway et al.. ASSESS used for feature extractionASSESS used for feature extraction Speech read by non-professionalsSpeech read by non-professionals Emotion evoking textsEmotion evoking texts Categories: sadness, happiness, fear, Categories: sadness, happiness, fear,
anger, neutralanger, neutral
Up to 55% recognitionUp to 55% recognition
8787
Recalled emotionsRecalled emotions Subjects are asked to recall Subjects are asked to recall
emotional episodes and describe emotional episodes and describe themthem
Data is composed of long narrativesData is composed of long narratives It isn’t clear if subjects actually re-It isn’t clear if subjects actually re-
experience these emotions or just experience these emotions or just recount them as “observers”recount them as “observers”
Can contain good instances of low-Can contain good instances of low-key pervasive emotionskey pervasive emotions
9090
Robust raw feature Robust raw feature extractionextraction
Pitch and VAD (voice Pitch and VAD (voice activity detection)activity detection)
Intensity Intensity (normalization)(normalization)
Vocal qualityVocal quality Duration – is this still Duration – is this still
an open problem? an open problem?
9191
Determination of time Determination of time intervalsintervals
This might have to be addressed on a This might have to be addressed on a theoretical vs. practical level –theoretical vs. practical level – Phones?Phones? Words?Words? Tunes?Tunes? Intonation units?Intonation units? Fixed length intervals?Fixed length intervals?
9292
Feature extractionFeature extraction
Which features are most relevant to Which features are most relevant to emotion?emotion?
How do we separate noise (speaker How do we separate noise (speaker mannerisms, culture, language, etc) mannerisms, culture, language, etc) from the signals of emotion?from the signals of emotion?
9494
Tangible results we are Tangible results we are expected to deliverexpected to deliver::
ToolsTools ExemplarsExemplars
9595
Tools:Tools:
Something along the lines of:Something along the lines of:solutions to solutions to partsparts of the problem of the problem that people can actually download that people can actually download and use right offand use right off
9696
Exemplars:Exemplars:
These should cover a wide scope -These should cover a wide scope - Concepts Concepts Methodologies Methodologies Knowledge pools – tutorials, reviews, etc.Knowledge pools – tutorials, reviews, etc. Complete solutions to “reduced” problemsComplete solutions to “reduced” problems Test-bed systemsTest-bed systems Designs for future systems/applicationsDesigns for future systems/applications
9797
Tools - Tools - suggestionssuggestions::
Useful Useful feature extractorsfeature extractors:: Robust pitch detection and smoothing Robust pitch detection and smoothing
methodsmethods Public domain segment/speech Public domain segment/speech
recognizersrecognizers Synthesis engines or parts thereofSynthesis engines or parts thereof
E.g. emotional prosody generatorsE.g. emotional prosody generators
Classifying enginesClassifying engines
9898
Exemplars - Exemplars - suggestionssuggestions:: Knowledge bases Knowledge bases --
A taxonomy of speech features A taxonomy of speech features Papers (especially short ones) say what we usedPapers (especially short ones) say what we used What about why? And what we didn’t used? What about why? And what we didn’t used? What about what we wished we had?What about what we wished we had?
Test-bed systemsTest-bed systems - - A working modular SAL (A working modular SAL (credit to Marc credit to Marc
SchroederSchroeder)) Embodies analysis, classification, synthesis, emotion Embodies analysis, classification, synthesis, emotion
induction/data collection …induction/data collection …like a breeder nuclear reactor!like a breeder nuclear reactor!
Parts of it already existParts of it already exist Human parts can be replaced by automated ones as Human parts can be replaced by automated ones as
they developthey develop
9999
Exemplars – Exemplars – suggestions suggestions (cont):(cont):
More focused systems –More focused systems – Call center systemsCall center systems
Deal with sparse emotional contentDeal with sparse emotional content emotions vary over a relatively small rangeemotions vary over a relatively small range
StandardizedStandardized (provocative?) (provocative?) datadata Exemplifying difficulties on different levels: Exemplifying difficulties on different levels:
feature extraction, emotion classificationfeature extraction, emotion classification Maybe in conjunction with WP5Maybe in conjunction with WP5
IntegrationIntegration Demonstrations of how different modalities Demonstrations of how different modalities
can complement/enhance each othercan complement/enhance each other
100100
How do we get useful info from How do we get useful info from WP3 and WP5WP3 and WP5??
CategoriesCategories ScalesScales Models (pervasive, burst etc)Models (pervasive, burst etc)
101101
What is it realistic to What is it realistic to expectexpect??
Useful info from other workgroupsUseful info from other workgroups WP3:WP3:
Models of emotional behavior in different Models of emotional behavior in different contextscontexts
Definite scales and categories for measuring itDefinite scales and categories for measuring it WP5:WP5:
Databases embodying the aboveDatabases embodying the above Data which exemplifies data on the scale from Data which exemplifies data on the scale from
Clearly identifiableClearly identifiable … to … … to …
Difficult to identifyDifficult to identify
102102
What is it realistic to What is it realistic to expectexpect??
Exemplars that showExemplars that show Some of the problems that are easier to Some of the problems that are easier to
solvesolve The many problems that are difficult to The many problems that are difficult to
solvesolve Directions for useful further researchDirections for useful further research How not to repeat previous errorsHow not to repeat previous errors
103103
Some personal thoughtsSome personal thoughts Oversimplification is a common pitfall Oversimplification is a common pitfall
to be avoidedto be avoided Looking at real data, one finds that Looking at real data, one finds that
emotion is oftenemotion is often Difficult to describe in simple termsDifficult to describe in simple terms Jumps between modalities (text might Jumps between modalities (text might
be considered a separate modality)be considered a separate modality) Extremely dependent on context, Extremely dependent on context,
character, settings, personalitycharacter, settings, personality A task so complex for humans cannot A task so complex for humans cannot
be easy for machines!be easy for machines!
104104
SummarySummary Speech is a major channel for Speech is a major channel for
signaling emotional informationsignaling emotional information And lots of other information tooAnd lots of other information too
HUMAINE will not solve all the issues HUMAINE will not solve all the issues involvedinvolved We should focus on those that can We should focus on those that can
benefit most from the expertise and benefit most from the expertise and collaboration of its memberscollaboration of its members
Examining multiple modalities can prove Examining multiple modalities can prove extremely interestingextremely interesting