104
1 Emotion Emotion and and Speech Speech Techniques, models and results Techniques, models and results Facts, fiction and opinions Facts, fiction and opinions Past present and future Past present and future Acted, spontaneous, recollected Acted, spontaneous, recollected In Asia Europe and America In Asia Europe and America And the middle east And the middle east HUMAINE Workshop on Signals and signs (WP4), Santorini, September HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004 2004

1 Emotion and Speech Techniques, models and results Facts, fiction and opinions Past present and future Acted, spontaneous, recollected In Asia Europe

Embed Size (px)

Citation preview

11

EmotionEmotionand and

SpeecSpeechhTechniques, models and resultsTechniques, models and results

Facts, fiction and opinionsFacts, fiction and opinionsPast present and futurePast present and future

Acted, spontaneous, recollectedActed, spontaneous, recollectedIn Asia Europe and AmericaIn Asia Europe and America

And the middle eastAnd the middle east

HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004

22

OverviewOverview A short introduction to speech A short introduction to speech

sciencescience … … and speech analysis toolsand speech analysis tools Speech and emotion:Speech and emotion:

Models, problems ... andModels, problems ... and ResultsResults

A review of open issues A review of open issues Deliverables within the HUMAINE Deliverables within the HUMAINE

frameworkframework

33

Part Part 1:1:

Speech science in a nutshellSpeech science in a nutshell

44

A short introduction to A short introduction to SPEECH:SPEECH:

Most of those present here are familiar Most of those present here are familiar with various aspects of signal with various aspects of signal processingprocessing

For the benefit of those who aren’t For the benefit of those who aren’t acquainted with the acquainted with the speech signalspeech signal in in particular:particular: We’ll start with an overview of speech We’ll start with an overview of speech

production models and analysis techniquesproduction models and analysis techniques The rest of you can sleep for a few The rest of you can sleep for a few

minutesminutes

55

The speech signalThe speech signal A 1-D signalA 1-D signal

Does that make it a simple one? NO…Does that make it a simple one? NO… There are many analysis techniques There are many analysis techniques Like many types of systems - Like many types of systems - parametric parametric

modelsmodels are one very useful here… are one very useful here… A simple and very useful A simple and very useful speech speech

productionproduction model:model: the the source/filtersource/filter model model

(in case you’re worried, we’ll see that (in case you’re worried, we’ll see that this is directly related to emotions also)this is directly related to emotions also)

66

The source/filter modelThe source/filter model Components: Components:

The lungs (create air pressure)The lungs (create air pressure) Two elements that turn this Two elements that turn this

into a “raw” signal:into a “raw” signal: The vocal folds (periodic signals)The vocal folds (periodic signals) Constrictions that make the Constrictions that make the

airflow turbulent (noise)airflow turbulent (noise) The vocal tract

Partly immobile: upper jaw, teeth Partly mobile: soft palate, tongue,

lips, lower jaw – also called “articulators”

Its influence on the raw signal can be modeled very will with a low order (~10) digital filter

source

filter

77

The net resultThe net result:: A complex signal that changes its properties A complex signal that changes its properties

constantly:constantly: Sometimes periodicSometimes periodic Sometimes colored noiseSometimes colored noise Approximately stationary Approximately stationary

over time windows of ~20 over time windows of ~20 millisecondsmilliseconds

And of course – And of course – contains a great deal of informationcontains a great deal of information TextText – – linguistic information linguistic information Other stuff Other stuff – – paralinguistic informationparalinguistic information

Speaker identitySpeaker identity GenderGender Socioeconomic backgroundSocioeconomic background Stress, accentStress, accent Emotional stateEmotional state Etc. …Etc. …

88

How is this information How is this information codedcoded?? Textual informationTextual information - -

mainly in the mainly in the filterfilter and the way it changes and the way it changes its properties over timeits properties over time Filter “snapshots” are called Filter “snapshots” are called segmentssegments

Paralinguistic informationParalinguistic information – – mainly in the mainly in the sourcesource parameters parameters Lung pressure – determines the intensityLung pressure – determines the intensity Vocal fold periodicity – determines Vocal fold periodicity – determines

instantaneous frequency or “pitch”instantaneous frequency or “pitch” Configuration of the glottis determines overall Configuration of the glottis determines overall

spectral tilt – “voice quality”spectral tilt – “voice quality”

99

ProsodyProsody:: Prosody is another name for Prosody is another name for partpart of the of the

paralinguistic information, composed of:paralinguistic information, composed of: IntonationIntonation – the way in which pitch changes – the way in which pitch changes

over timeover time IntensityIntensity – changes in intensity over time – changes in intensity over time

Problem: some segments are inherently weaker than Problem: some segments are inherently weaker than othersothers

RhythmRhythm – segment durations vs. time – segment durations vs. time

Prosody does not include Prosody does not include voice qualityvoice quality, but , but voice quality is also part of the voice quality is also part of the paralinguistic informationparalinguistic information

1010

To summarizeTo summarize:: Speech scienceSpeech science is at a mature stage is at a mature stage The source/filter model is very useful in The source/filter model is very useful in

understanding speech productionunderstanding speech production Many applications (speech recognition, Many applications (speech recognition,

speaker verification, speaker verification, emotion recognition,emotion recognition, etc.) require etc.) require extraction of the model extraction of the model parametersparameters from the speech signal (an from the speech signal (an inverse problem)inverse problem)

This is the domain of:This is the domain of:speech analysis techniquesspeech analysis techniques

1111

Speech analysis and Speech analysis and classificationclassification

Part Part 2:2:

1212

The large picture: The large picture: speech analysis in the HUMAINE speech analysis in the HUMAINE frameworkframework

Speech Speech analysisanalysis is just one component is just one component in the context of speech and emotion:in the context of speech and emotion:

TheoryOf

emotion

Trainingdata

SpeechAnalysisengine

Realdata

HighLevel

application

Its overall objectives:Its overall objectives: Calculate raw speech Calculate raw speech

parametersparameters Extract features salient Extract features salient

to emotional contentto emotional content Discard irrelevant Discard irrelevant

featuresfeatures Use them to characterize Use them to characterize

and maybe classify and maybe classify emotional speechemotional speech

1313

Signals to Signs - The Signals to Signs - The processprocess

Data Warehouse

DataRepresentation

DatabasesFiles

Data Cleaning and Integration

Selection and Transformation

Patterns

Data Mining

Knowledge

Evaluation and Presentation

1414

S2S S2S ((SOS…?SOS…?) ) - The tools- The tools

a combination of techniques that a combination of techniques that belong to different types of disciplines:belong to different types of disciplines: Data warehouse technologies (data Data warehouse technologies (data

storage, information retrieval, query storage, information retrieval, query answering, etc’)answering, etc’)

Data preprocessing and handlingData preprocessing and handling Data modeling / visualization Data modeling / visualization Machine learning (statistical data Machine learning (statistical data

analysis, pattern recognition, information analysis, pattern recognition, information retrieval, etc’)retrieval, etc’)

1515

The objective of speech The objective of speech analysis techniquesanalysis techniques

1.1. To To extract the raw model parametersextract the raw model parameters from the speech signalfrom the speech signal

Interfering factors:Interfering factors: Reality never exactly fits the modelReality never exactly fits the model Background noise Background noise Speaker overlapSpeaker overlap

2.2. To extract To extract featuresfeatures

3.3. To To interpret them interpret them in meaningful ways in meaningful ways (pattern recognition)(pattern recognition)

Really hard!Really hard!

1616

It remains thatIt remains that- -

Useful models and techniques exist for extracting Useful models and techniques exist for extracting the various information types from the speech the various information types from the speech signalsignal

Yet …Yet …Many applications Many applications such as speech recognition, such as speech recognition, speaker identification, speech synthesis, etc., speaker identification, speech synthesis, etc., are are far from being perfectedfar from being perfected

… … So what about So what about emotion?emotion?

1717

For the moment – For the moment – let’s focus on the let’s focus on the smallsmall picturepicture The consensus is that emotions are coded The consensus is that emotions are coded

inin ProsodyProsody Voice qualityVoice quality And sometimes in the textual informationAnd sometimes in the textual information

Let’s discuss the purely technical aspects Let’s discuss the purely technical aspects of evaluating all of these …of evaluating all of these …

1818

Extracting features from the Extracting features from the speech signalspeech signal

Stage 1Stage 1 – – Extracting raw featuresExtracting raw features:: PitchPitch IntensityIntensity Voice qualityVoice quality PausesPauses Segmental information – phones and their Segmental information – phones and their

durationduration TextText

(by the way …(by the way …who extracts them – man, who extracts them – man, machine or both? machine or both? ))

1919

PitchPitch Pitch: The instantaneous frequencyPitch: The instantaneous frequency

Sounds deceptively simple to find – but it isn’t!Sounds deceptively simple to find – but it isn’t! Lots of research has been devoted to pitch Lots of research has been devoted to pitch

detectiondetection Composed of two sub-problems:Composed of two sub-problems:

For a given signal – is there periodicity at all?For a given signal – is there periodicity at all? If so – what’s the fundamental frequency?If so – what’s the fundamental frequency?

Complicating factors:Complicating factors: Speaker related factors – hoarseness, diplophony, etc.Speaker related factors – hoarseness, diplophony, etc. Background related factors – noise, overlapping speakers, Background related factors – noise, overlapping speakers,

filters (as in telephony)filters (as in telephony) In the context of emotions:In the context of emotions:

Small errors are acceptableSmall errors are acceptable Large errors (octave jumps, false positives) are Large errors (octave jumps, false positives) are

catastrophiccatastrophic

2020

An exampleAn example:: The raw pitch contour in PRAAT:The raw pitch contour in PRAAT:

Errors:

2121

IntensityIntensity Appears to be even simpler than pitch!Appears to be even simpler than pitch! Intensity is quite easy to measure …Intensity is quite easy to measure …

Yet most influenced by unrelated factors!Yet most influenced by unrelated factors! Aside from the speaker, intensity is Aside from the speaker, intensity is

gravely affected by:gravely affected by: Distance from the microphoneDistance from the microphone Gain settings in the recording equipmentGain settings in the recording equipment

ClippingClipping AGCAGC

Background noiseBackground noise Recording environmentRecording environment

Without normalization – intensity is almost Without normalization – intensity is almost useless!useless!

2222

Voice qualityVoice quality Several measures are used to measure it:Several measures are used to measure it:

Local irregularity in pitch and intensityLocal irregularity in pitch and intensity Ratio between harmonic components and noise Ratio between harmonic components and noise

componentscomponents Distribution of energy in the spectrumDistribution of energy in the spectrum

Affected by a multitude of factors other Affected by a multitude of factors other than emotionsthan emotions

Some standardized measures are often Some standardized measures are often used in clinical applicationsused in clinical applications

A large factor in emotional speech!A large factor in emotional speech!

2323

SegmentsSegments

There are different ways of defining There are different ways of defining precisely what these areprecisely what these are

Automatic segmentation is difficult, Automatic segmentation is difficult, though not as difficult as speech though not as difficult as speech recognitionrecognition

Even the segment boundaries can Even the segment boundaries can give important timing information, give important timing information, related to rhythm – related to rhythm – an important component of prosodyan important component of prosody

2424

TextText Is this “raw” data or not?Is this “raw” data or not? Is it dataIs it data … … at allat all??

Some studies on emotion specifically eliminated Some studies on emotion specifically eliminated this factor (this factor (filtered speech, uniform textsfiltered speech, uniform texts))

Other studies are interested Other studies are interested mainlymainly in text in text If we want to deal with text, we must keep in If we want to deal with text, we must keep in

mind:mind: Automated speech recognition is HARD!Automated speech recognition is HARD!

Especially with strong background noiseEspecially with strong background noise Especially when strong emotions are present, modifying Especially when strong emotions are present, modifying

the speakers normal voices and mannerismsthe speakers normal voices and mannerisms Especially when dealing with multiple speakersEspecially when dealing with multiple speakers

2525

Some complicating factors in Some complicating factors in raw feature extractionraw feature extraction::

Background noiseBackground noise Speaker overlapSpeaker overlap Speaker variabilitySpeaker variability Variability in recording equipmentVariability in recording equipment

2626

In the general context of In the general context of speech analysisspeech analysis- -

The raw features we discussed are The raw features we discussed are not specific only to the study of not specific only to the study of emotionemotion

YetYet – issues related to calculating – issues related to calculating them them reliablyreliably crop up again and crop up again and again in emotion related studiesagain in emotion related studies

Some standard and reliable tools Some standard and reliable tools would be very helpfulwould be very helpful

2727

Two opposing approaches to Two opposing approaches to computing raw features:computing raw features:

Assume we have perfect Assume we have perfect algorithmsalgorithms for extracting all this for extracting all this informationinformation If we don’tIf we don’t – help out – help out manuallymanually This can be carried out only over small This can be carried out only over small

databasesdatabases Useful in purely theoretical studiesUseful in purely theoretical studies

Acknowledge we only have Acknowledge we only have imperfect algorithmsimperfect algorithms Find how to deal Find how to deal automaticallyautomatically with with

imperfect dataimperfect data Very important for large databasesVery important for large databases

Ideal

Errorprone

Real life

2828

NextNext - what do we do with it - what do we do with it allall??

Reminder:Reminder: we have large amounts of we have large amounts of raw dataraw data

Now we have to make some meaning Now we have to make some meaning from itfrom it

2929

Feature extractionFeature extraction… …

Stage 2 Stage 2 – – data reductiondata reduction:: Take a sea of numbersTake a sea of numbers Reduce it to a small number of Reduce it to a small number of

meaningful measuresmeaningful measures Prove they’re meaningful Prove they’re meaningful

An interesting way to look at it:An interesting way to look at it: Separating the “Separating the “signalsignal” (e.g emotion) ” (e.g emotion)

from the “from the “noisenoise” (anything else) ” (anything else)

3030

An example of “Noise”:An example of “Noise”: Here pitch Here pitch

and and intensity intensity have totally have totally unemotionunemotional (but al (but important) important) roles:roles:[Deller et [Deller et al]al]

3131

Examples of high level Examples of high level featuresfeatures

Pitch fitting – Pitch fitting – stylizationstylization MoMel MoMel Parametric modeling Parametric modeling statisticsstatistics

32

An example:

The raw pitch contour in PRAAT:

Errors:

33

Patching it up a bit:

Time (s)0 3.39769

0

500

Time (s)0 3.39769

0

500

34

One way to extract the essential information:

Time (s)0 3.39769

0

500

Time (s)0 3.39769

0

500

Pitch stylization – IPO method

35

Another way to extract the essential information:

MoMel

36

Yet another way to extract the essential information:

MoMel

3737

Some observationsSome observations::

Different parameterizations give Different parameterizations give different curvesdifferent curves different featuresdifferent features

Yet: perceptually – they are all very Yet: perceptually – they are all very similarsimilar

3838

QuestionsQuestions:: We can ask what is the We can ask what is the minimalminimal or or

most representativemost representative information to information to capture the pitch contour?capture the pitch contour?

More importantly, though:More importantly, though:What aspects of the pitch contour What aspects of the pitch contour are most relevant to emotion?are most relevant to emotion?

3939

Several answers appear in the Several answers appear in the literatureliterature::

Statistical features taken from the Statistical features taken from the raw contour:raw contour: Mean, variance, max, min, range etc.Mean, variance, max, min, range etc.

Features taken from parameterized Features taken from parameterized contourscontours:: Slopes, “main” peaks and dips etc.Slopes, “main” peaks and dips etc.

4040

There’s not much time to go There’s not much time to go intointo::

Intensity contoursIntensity contours SpectraSpectra DurationDuration

But the problems are very similarBut the problems are very similar

4141

The importance of time The importance of time framesframes We have several measures that vary over timeWe have several measures that vary over time Over what time frame should we consider Over what time frame should we consider

them?them?

The meaning we attribute to speech The meaning we attribute to speech parameters is dependent on the time frame parameters is dependent on the time frame over which they’re considered:over which they’re considered: Fixed length windowsFixed length windows PhonesPhones WordsWords ““Intonation units”Intonation units” ““Tunes”Tunes”

4242

Which time frame is bestWhich time frame is best?? Fixed time framesFixed time frames of several seconds – simple to of several seconds – simple to

implement, but naïveimplement, but naïve Very arbitraryVery arbitrary

WordsWords Need a recognizer to be markedNeed a recognizer to be marked Probably the shortest meaningful frameProbably the shortest meaningful frame

““Intonation unitsIntonation units”” Nobody knows exactly what they are (one “idea” per Nobody knows exactly what they are (one “idea” per

unit?)unit?) Hard to measureHard to measure Correlate best with coherent stretches of speechCorrelate best with coherent stretches of speech

““TunesTunes” – from one pause to the next” – from one pause to the next feasible to implementfeasible to implement Correlate to some extent with coherent stretches of Correlate to some extent with coherent stretches of

speech.speech.

4343

Why is this such an important Why is this such an important decisiondecision??

It might help us interpret our data It might help us interpret our data correctly!correctly!

4444

Therefore …Therefore …the problem of feature the problem of feature

extractionextraction:: Is NOT a general oneIs NOT a general one We want features that are We want features that are

specifically relevant to specifically relevant to emotional emotional contentcontent … …

But before we get to that -But before we get to that -we have:we have:

4545

The The Data Mining Data Mining partpart

Stage 3: Stage 3: ToTo extract extract knowledgeknowledge = previously = previously

unknown informationunknown information (rules, (rules, constraints, regularities, constraints, regularities, patterns, etc’) from the patterns, etc’) from the

features databasefeatures database

4646

What are we miningWhat are we mining?? We look for patterns that either We look for patterns that either describedescribe the stored the stored

data data

or or inferinfer from it (predictions)from it (predictions)

slope 20pause 30accent 1 15accent 2 30duration 5

slope

pause

accent 1

accent 2

duration

Summarization and characterization (of the class of data that interests us)

Discrimination and comparison of features of different classes

Eran Rafi Haim Yuvalbefore gamble 20 25 20 15after gamble 10 18 15 15

Eran Rafi Haim Yuval

before gamble

after gamble0

5

10

15

20

25

before gamble

after gamble

4747

Types of AnalysisTypes of Analysis Association analysisAssociation analysis of rules of the form of rules of the form X X

=> Y=> Y((DB tuples that satisfy DB tuples that satisfy X X are likely to satisfy are likely to satisfy Y)Y)where where X X and and Y Y are pairs of attribute and are pairs of attribute and value/set of valuesvalue/set of values

Classification and class predictionClassification and class prediction – – find a find a set of functions to describe and distinguish set of functions to describe and distinguish data classes/concepts that can be used data classes/concepts that can be used predict the class of unlabeled data.predict the class of unlabeled data.

Cluster analysis (unsupervised Cluster analysis (unsupervised clustering)clustering) – – analyze the data when there analyze the data when there are no class labels to deal with new types of are no class labels to deal with new types of data and help group similar events togetherdata and help group similar events together

4848

Association RulesAssociation Rules We search for We search for interestinginteresting relationships relationships

among items in the data among items in the data Interestingness Measures: Interestingness Measures:

BA •Support = # tuples that contain both A and B /

# tuples •Confidence = # tuples that contain both A and B /

# tuples that contain ASupport measures usefulness )( BAP

Confidence measures certainty )|( ABP

4949

ClassificationClassificationA two step process:

1. Use data tuples with known labels to construct a model

2. Use the learned model to classify (assign labels) new data

Since the class label of each training sample is known, this is Supervised Learning

Test data is used to estimate the predictive accuracyof the learned model.

Data is divided into two groups: training data and test data

5050

AssetsAssets No need to know the rules in advanceNo need to know the rules in advance Some rules are not easily formulated Some rules are not easily formulated

as mathematical or logical expressionsas mathematical or logical expressions Similar to one of the ways human learn Similar to one of the ways human learn Could be more robust to noise and Could be more robust to noise and

incomplete dataincomplete data May require a lot of samplesMay require a lot of samples Learning depends on existing data Learning depends on existing data

only!only!

5151

Algorithms:Algorithms: Machine learning (Statistical learning)Machine learning (Statistical learning) Expert systemsExpert systems Computational neuroscienceComputational neuroscience

Dangers:Dangers: The model might not be able to learnThe model might not be able to learn There might not be enough dataThere might not be enough data Over-fitting the model to the training Over-fitting the model to the training

datadata

5252

PredictionPrediction

Classification predicts categorical labelsClassification predicts categorical labels PredictionPrediction models continuous valued models continuous valued

functionfunction It is usually used to predict the value or It is usually used to predict the value or

a range of values of an attribute of a a range of values of an attribute of a given samplegiven sample RegressionRegression Neural NetworksNeural Networks

5353

ClusteringClustering constructing models for assigning class labels constructing models for assigning class labels

to data that is unlabeled.to data that is unlabeled. un supervised learningun supervised learning Clustering is an Clustering is an ill definedill defined task task Once clusters are discovered, the clustering Once clusters are discovered, the clustering

model can be used for predicting labels of model can be used for predicting labels of new datanew data

Alternatively, the clusters can be used as Alternatively, the clusters can be used as labels to train a supervised classification labels to train a supervised classification algorithmalgorithm

5454

So how does this technicalSo how does this technicalMumbo JumboMumbo Jumbo

tie into -tie into -

5555

Speech and emotionSpeech and emotion

Part Part 3:3:

5656

Speech and emotionSpeech and emotion Emotion can affect speech in many waysEmotion can affect speech in many ways

ConsciouslyConsciously Unconsciously Unconsciously Through the Autonomous nervous systemThrough the Autonomous nervous system Examples:Examples:

Textual contentTextual content is usually consciously chosen, except is usually consciously chosen, except maybe sudden interjections which may stem from maybe sudden interjections which may stem from sudden or strong emotionssudden or strong emotions

Many speech patterns related to emotions are Many speech patterns related to emotions are strongly strongly ingrainedingrained – therefore, though they – therefore, though they cancan be be controlled by the speaker, most often they are not, controlled by the speaker, most often they are not, unless the speaker tries modify them consciouslyunless the speaker tries modify them consciously

Certain speech characteristics are affected by the Certain speech characteristics are affected by the degree of arousal, and therefore nearly impossible to degree of arousal, and therefore nearly impossible to inhibit (e.g. inhibit (e.g. vocal tremorvocal tremor due to grief) due to grief)

5757

Speech analysis: the big Speech analysis: the big picture - againpicture - again

Speech analysis is just one component in Speech analysis is just one component in the context of speech and emotion:the context of speech and emotion:

ApplicationRealdata Speech analysis

Theories of emotion

Databases

5858

Is this just another way to Is this just another way to spread the blamespread the blame??

Us speech analysis guys are just poor little Us speech analysis guys are just poor little engineersengineers

Methods we can supply can be no better than Methods we can supply can be no better than the the theorytheory and the and the datadata that drive them that drive them

… … and unfortunately, the jury is still out on and unfortunately, the jury is still out on both of those points … both of those points … or notor not??

Ask WP3 and WP5 peopleAsk WP3 and WP5 people They’re here somewhere They’re here somewhere

Actually –Actually – One of the difficulties HUMAINE is intended to ease, One of the difficulties HUMAINE is intended to ease,

is that often researchers in the field find themselves is that often researchers in the field find themselves having to address having to address allall of the above! ( of the above! (guiltyguilty))

5959

The most fundamental The most fundamental problemproblem::

WhatWhat are the features that signify emotion? To are the features that signify emotion? To paraphrase – what signals are signs of paraphrase – what signals are signs of emotion?emotion?

6060

The most common The most common solutionssolutions::

Calculate as many as you can think Calculate as many as you can think ofof

IntuitionIntuition Theory based answersTheory based answers Data-driven answersData-driven answers

Ha! Once more – it’s not our fault!Ha! Once more – it’s not our fault!

6161

What seems to be the most What seems to be the most plausible approachplausible approach - -

The data driven approachThe data driven approach

Requiring:Requiring: Emotional speech databases (“corpora”)Emotional speech databases (“corpora”) Perceptual evaluation of these databasesPerceptual evaluation of these databases

This is then correlated with speech This is then correlated with speech featuresfeatures Which takes us back to a previous squareWhich takes us back to a previous square

6262

So tell us already – how does So tell us already – how does emotion influence speechemotion influence speech??

… … It seems that the answer depends It seems that the answer depends on on howhow you look for it you look for it

As hinted before – the answer cannot As hinted before – the answer cannot really be separated from:really be separated from: The theories of emotionThe theories of emotion The databases we have of emotional The databases we have of emotional

speech -speech - Who the subjects areWho the subjects are How emotion was elicitedHow emotion was elicited

63

A short digression -

Will all the speech clinicians in the audience please stand up?

Hmm…. We don’t seem to have so many

Let’s look at what one of them has to say

64

Emotions in the speech Clinic

Some speakers have speech/voice problems that modify their “signal”, thus misleading the listener

VOICE – People with vocal instability (high jitter/shimmer/tremor are clinically perceived as nervous (although the problems reflect irregularity in the vocal folds).

- Breathy voice (in women) is, sometimes, perceived as “sexy” (while it actually reflects incomplete adduction of the vocal folds).

- Higher excitation level leads to vocal instability (high jitter/shimmer/ tremor)

65

Clinical Examples:

STUTTERING – listeners judge people who stutter as nervous, tensed, and less confident (identification of stuttering depends on pause duration within the “repetition units”, and on rate of repetitions).

CLUTTERING – listeners judge cluttering people as nervous and less intelligent

6666

So- So- though this is a WP4 meetingthough this is a WP4 meeting… …

It’s impossible to avoid talking about WP3 It’s impossible to avoid talking about WP3 (theory of emotion) and WP5 (databases) (theory of emotion) and WP5 (databases) issuesissues

The signs we’re looking for can never be The signs we’re looking for can never be separated from the questions:separated from the questions: Signs Signs ofof what (emotions)? what (emotions)? Signs Signs inin what (data)? what (data)?

May God and Phillipe Gelin forgive me …May God and Phillipe Gelin forgive me …

6767

A not-so-old example:A not-so-old example:(Murray and Arnott, 1993)(Murray and Arnott, 1993)

Very qualitativeVery qualitative Presupposes dealing with primary emotionsPresupposes dealing with primary emotions

6868

BUTBUT… … If you expect more If you expect more

recent results to give recent results to give more detailed more detailed descriptivedescriptive outlines outlines Then you’re wrongThen you’re wrong

The data-driven The data-driven approaches use a approaches use a large number of large number of features, and let the features, and let the computer sort them computer sort them outout 32 significant features 32 significant features

found by ASSESS, from found by ASSESS, from the initial 375 usedthe initial 375 used

5 emotions, acted5 emotions, acted 55% recognition55% recognition

6969

Some remarksSome remarks:: Some features are Some features are indicativeindicative, even though , even though

we probably don’t use them we probably don’t use them perceptuallyperceptually e.g. e.g. pitch meanpitch mean: usually this is raised with : usually this is raised with

higher activationhigher activation But we don’t have to know the speaker’s But we don’t have to know the speaker’s

neutral mean to perceive heightened neutral mean to perceive heightened activationactivation

My guess:My guess: voice quality is what we perceive in voice quality is what we perceive in such casessuch cases

How “simple” can characterization of How “simple” can characterization of emotions become?emotions become? How many features do we listen for?How many features do we listen for? Can this be verified?Can this be verified?

7070

Time intervalsTime intervals This issue becomes more and more This issue becomes more and more

important as we go towards “natural” important as we go towards “natural” datadata

Emotion production:Emotion production: How long do emotions last?How long do emotions last?

Full blown emotions are usually short (Full blown emotions are usually short (but not but not always! Look at Peguy in the LIMSI interview always! Look at Peguy in the LIMSI interview databasedatabase))

Moods, or pervasive emotions are subtle but Moods, or pervasive emotions are subtle but long lastinglong lasting

Emotion Analysis:Emotion Analysis: Over what span of speech are they Over what span of speech are they

easiest to detect?easiest to detect?

7171

From the analysis viewpointFrom the analysis viewpoint:: Current efforts seem to be focusing on Current efforts seem to be focusing on

methods that aim to use time spans that methods that aim to use time spans that have some have some inherent meaninginherent meaning:: Acoustically (ASSESS – Cowie et al)Acoustically (ASSESS – Cowie et al) Linguistically (Batliner et al)Linguistically (Batliner et al)

We mentioned that prosody carries We mentioned that prosody carries emotional information (our “signal”) emotional information (our “signal”) other information (“noise”): phrasing, various other information (“noise”): phrasing, various

types of prominencetypes of prominence

BUT …BUT …

7272

Why I like intonation unitsWhy I like intonation units Spontaneous speech is organized differently from Spontaneous speech is organized differently from

written languagewritten language ““sentences” and “paragraphs” don’t really exist theresentences” and “paragraphs” don’t really exist there PhrasingPhrasing is a loose phrase for … is a loose phrase for …”Intonation units””Intonation units”

Theoretical linguists love to discuss what they areTheoretical linguists love to discuss what they are An exact definition is as hard to find as it is to parse spontaneous An exact definition is as hard to find as it is to parse spontaneous

speechspeech Prosodic markers help replace various written markersProsodic markers help replace various written markers Maybe emotion is not an “orthogonal” bit of Maybe emotion is not an “orthogonal” bit of

information on top of these (the signal+noise model)information on top of these (the signal+noise model) If emotion If emotion modifiesmodifies these, these,

It would be very useful if we could identify the prosodic It would be very useful if we could identify the prosodic markers we use and the ways we modify them when we’re markers we use and the ways we modify them when we’re emotionalemotional

Problem: Problem: Engineers don’t like ill defined Engineers don’t like ill defined concepts!concepts! But emotion is one of them too, isn’t it?But emotion is one of them too, isn’t it?

7373

Just to provoke some Just to provoke some thoughtthought::

From a paper on From a paper on animationanimation ((think of it – these guys have to integrate speech think of it – these guys have to integrate speech and image to make them fit naturally)and image to make them fit naturally)::“… “… speech consists of a sequence of intonation speech consists of a sequence of intonation phrases. Each intonation phrase is realized with phrases. Each intonation phrase is realized with fluid, continuous articulation and a single point of fluid, continuous articulation and a single point of maximum emphasis. Boundaries between maximum emphasis. Boundaries between successive phrases are associated with perceived successive phrases are associated with perceived disjuncture and are marked in English with cues disjuncture and are marked in English with cues such as pitch movements … Gestures are such as pitch movements … Gestures are performed in units that coincide with these performed in units that coincide with these intonation phrases, and points of prominence in intonation phrases, and points of prominence in gestures also coincide with the emphasis in the gestures also coincide with the emphasis in the concurrent speech…” concurrent speech…” [Stone et al., SIGGRAPH 2004][Stone et al., SIGGRAPH 2004]

7474

We haven’t even discussed We haven’t even discussed WP3 issuesWP3 issues- -

What are the scales/categories?What are the scales/categories? Possibility 1Possibility 1: : emotional labelingemotional labeling Possibility 2Possibility 2: : psychological scalespsychological scales (such (such

as valence/activation – e.g. Feeltrace)as valence/activation – e.g. Feeltrace)

QUESTION:QUESTION: Which is more directly related to Which is more directly related to

speech features?speech features?Hopefully we’ll hammer out a tentative answer by Hopefully we’ll hammer out a tentative answer by

Tuesday..Tuesday..

7575

Current resultsCurrent results

Part Part 4:4:

7676

Evaluating resultsEvaluating results Results often demonstrate how elusive the Results often demonstrate how elusive the

solution is …solution is … Consider a similar problem: Consider a similar problem: Speech Speech

RecognitionRecognition To evaluate results – To evaluate results –

Make recordingsMake recordings Submit them to an algorithmSubmit them to an algorithm Measure the recognition rate!Measure the recognition rate!

Emotion recognition Emotion recognition results are far more results are far more difficult to quantifydifficult to quantify Heavily dependent on induction techniques Heavily dependent on induction techniques

and labeling methodsand labeling methods

7777

Several popular contextsSeveral popular contexts:: Acted prototypical emotionsActed prototypical emotions Call center dataCall center data

RealReal WoZ typeWoZ type

Media (radio, TV) based dataMedia (radio, TV) based data Narrative speech (event recollection)Narrative speech (event recollection) Synthesized speech (Synthesized speech (monterro, goblmonterro, gobl))

Most of these methods can be placed on the Most of these methods can be placed on the spectrum between:spectrum between: Acted, full blown bursts of stereotypical emotionsActed, full blown bursts of stereotypical emotions Fully natural, mixtures of mood, affect and bursts Fully natural, mixtures of mood, affect and bursts

of difficult-to-label emotions recorded in noisy of difficult-to-label emotions recorded in noisy environmentsenvironments

7878

Call centersCall centers

A real life scenario! (A real life scenario! (with commercial with commercial interestsinterests…)!…)!

Sparse emotional content:Sparse emotional content: Controlled (usually)Controlled (usually) Negative (usually)Negative (usually)

Lends itself easily to WOZ scenariosLends itself easily to WOZ scenarios

7979

Ang et al., 2002Ang et al., 2002 Standardized call-center data from 3 Standardized call-center data from 3

different sourcesdifferent sources Uninvolved users, true HMI interactionUninvolved users, true HMI interaction Detects neutral/annoyance/frustrationDetects neutral/annoyance/frustration Mostly automatic extraction, with some Mostly automatic extraction, with some

additional human labelingadditional human labeling Defines human “accuracy” as 75%Defines human “accuracy” as 75%

But this is actually the percentage of human But this is actually the percentage of human consensusconsensus

Machine accuracy is comparableMachine accuracy is comparable A possible measure: maybe “accuracy” is A possible measure: maybe “accuracy” is

where users wanted human interventionwhere users wanted human intervention

8080

Batliner et alBatliner et al.. Professional acting, amateur acting, WOZ Professional acting, amateur acting, WOZ

scenarioscenario the latter with uninvolved users, true HMI the latter with uninvolved users, true HMI

interactioninteraction Detects Detects trouble in communicationtrouble in communication

Much thought was given to this definitionMuch thought was given to this definition!! Combines prosodic features with others:Combines prosodic features with others:

POS labelsPOS labels Syntactic boundariesSyntactic boundaries

Overall – shows a typical result:Overall – shows a typical result: The closer we get to “real” scenarios, the more The closer we get to “real” scenarios, the more

difficult the problem becomes!difficult the problem becomes! Up to 95% on acted speechUp to 95% on acted speech Up to 79% on read speechUp to 79% on read speech Up to 73% on WOZ dataUp to 73% on WOZ data

8181

Devillers et alDevillers et al.. RealReal call center data call center data

Contains also fear (of losing money!)Contains also fear (of losing money!) Human – human interaction, involved usersHuman – human interaction, involved users Human accuracy of 75% is reportedHuman accuracy of 75% is reported

Is this, as in Ang, the degree of human Is this, as in Ang, the degree of human agreement?agreement?

Use a Use a smallsmall number of intonation features number of intonation features Treat pauses and filled pauses separatelyTreat pauses and filled pauses separately

Some results:Some results: Different behavior between clients and agents, Different behavior between clients and agents,

males and femalesmales and females Was classification attempted also?Was classification attempted also?

8282

Games and simulatorsGames and simulators These provide an extremely These provide an extremely

interesting setting interesting setting Participants can often be found to Participants can often be found to

experience real emotionsexperience real emotions The experimenter can sometimes The experimenter can sometimes

control these to a certain extent control these to a certain extent Such as driving conditions or additional Such as driving conditions or additional

tasks in a driving simulatortasks in a driving simulator

8383

Fernandez & Picard (2000)Fernandez & Picard (2000)

Subjects did math problems while Subjects did math problems while driving a simulatordriving a simulator This was supposed to induce stressThis was supposed to induce stress

Spectral features were usedSpectral features were used No prosody at all!No prosody at all!

Advanced classifiers were appliedAdvanced classifiers were applied Results were inconsistent across users, Results were inconsistent across users,

raising a familiar question:raising a familiar question: Is it the classifier, or is it the data?Is it the classifier, or is it the data?

8484

Kehrein (2002)Kehrein (2002) 2 subjects in 2 separate rooms:2 subjects in 2 separate rooms:

One had instructionsOne had instructions One had a set of Lego building blocksOne had a set of Lego building blocks The first had to explain to the other The first had to explain to the other

what to constructwhat to construct A wide range of “A wide range of “naturalnatural” emotions ” emotions

was reportedwas reported His thesis is in German His thesis is in German No classification was attempted No classification was attempted

8585

Acted speechActed speech

Widely usedWidely used An ever-recurring question:An ever-recurring question:

Does it reflect the way emotions are Does it reflect the way emotions are expressed in spontaneous speech?expressed in spontaneous speech?

8686

McGilloway et alMcGilloway et al.. ASSESS used for feature extractionASSESS used for feature extraction Speech read by non-professionalsSpeech read by non-professionals Emotion evoking textsEmotion evoking texts Categories: sadness, happiness, fear, Categories: sadness, happiness, fear,

anger, neutralanger, neutral

Up to 55% recognitionUp to 55% recognition

8787

Recalled emotionsRecalled emotions Subjects are asked to recall Subjects are asked to recall

emotional episodes and describe emotional episodes and describe themthem

Data is composed of long narrativesData is composed of long narratives It isn’t clear if subjects actually re-It isn’t clear if subjects actually re-

experience these emotions or just experience these emotions or just recount them as “observers”recount them as “observers”

Can contain good instances of low-Can contain good instances of low-key pervasive emotionskey pervasive emotions

8888

Ron and AmirRon and Amir Ongoing work Ongoing work

8989

Open issuesOpen issues

Part Part 5:5:

9090

Robust raw feature Robust raw feature extractionextraction

Pitch and VAD (voice Pitch and VAD (voice activity detection)activity detection)

Intensity Intensity (normalization)(normalization)

Vocal qualityVocal quality Duration – is this still Duration – is this still

an open problem? an open problem?

9191

Determination of time Determination of time intervalsintervals

This might have to be addressed on a This might have to be addressed on a theoretical vs. practical level –theoretical vs. practical level – Phones?Phones? Words?Words? Tunes?Tunes? Intonation units?Intonation units? Fixed length intervals?Fixed length intervals?

9292

Feature extractionFeature extraction

Which features are most relevant to Which features are most relevant to emotion?emotion?

How do we separate noise (speaker How do we separate noise (speaker mannerisms, culture, language, etc) mannerisms, culture, language, etc) from the signals of emotion?from the signals of emotion?

9393

HUMAINE DeliverablesHUMAINE Deliverables

Part Part 6:6:

9494

Tangible results we are Tangible results we are expected to deliverexpected to deliver::

ToolsTools ExemplarsExemplars

9595

Tools:Tools:

Something along the lines of:Something along the lines of:solutions to solutions to partsparts of the problem of the problem that people can actually download that people can actually download and use right offand use right off

9696

Exemplars:Exemplars:

These should cover a wide scope -These should cover a wide scope - Concepts Concepts Methodologies Methodologies Knowledge pools – tutorials, reviews, etc.Knowledge pools – tutorials, reviews, etc. Complete solutions to “reduced” problemsComplete solutions to “reduced” problems Test-bed systemsTest-bed systems Designs for future systems/applicationsDesigns for future systems/applications

9797

Tools - Tools - suggestionssuggestions::

Useful Useful feature extractorsfeature extractors:: Robust pitch detection and smoothing Robust pitch detection and smoothing

methodsmethods Public domain segment/speech Public domain segment/speech

recognizersrecognizers Synthesis engines or parts thereofSynthesis engines or parts thereof

E.g. emotional prosody generatorsE.g. emotional prosody generators

Classifying enginesClassifying engines

9898

Exemplars - Exemplars - suggestionssuggestions:: Knowledge bases Knowledge bases --

A taxonomy of speech features A taxonomy of speech features Papers (especially short ones) say what we usedPapers (especially short ones) say what we used What about why? And what we didn’t used? What about why? And what we didn’t used? What about what we wished we had?What about what we wished we had?

Test-bed systemsTest-bed systems - - A working modular SAL (A working modular SAL (credit to Marc credit to Marc

SchroederSchroeder)) Embodies analysis, classification, synthesis, emotion Embodies analysis, classification, synthesis, emotion

induction/data collection …induction/data collection …like a breeder nuclear reactor!like a breeder nuclear reactor!

Parts of it already existParts of it already exist Human parts can be replaced by automated ones as Human parts can be replaced by automated ones as

they developthey develop

9999

Exemplars – Exemplars – suggestions suggestions (cont):(cont):

More focused systems –More focused systems – Call center systemsCall center systems

Deal with sparse emotional contentDeal with sparse emotional content emotions vary over a relatively small rangeemotions vary over a relatively small range

StandardizedStandardized (provocative?) (provocative?) datadata Exemplifying difficulties on different levels: Exemplifying difficulties on different levels:

feature extraction, emotion classificationfeature extraction, emotion classification Maybe in conjunction with WP5Maybe in conjunction with WP5

IntegrationIntegration Demonstrations of how different modalities Demonstrations of how different modalities

can complement/enhance each othercan complement/enhance each other

100100

How do we get useful info from How do we get useful info from WP3 and WP5WP3 and WP5??

CategoriesCategories ScalesScales Models (pervasive, burst etc)Models (pervasive, burst etc)

101101

What is it realistic to What is it realistic to expectexpect??

Useful info from other workgroupsUseful info from other workgroups WP3:WP3:

Models of emotional behavior in different Models of emotional behavior in different contextscontexts

Definite scales and categories for measuring itDefinite scales and categories for measuring it WP5:WP5:

Databases embodying the aboveDatabases embodying the above Data which exemplifies data on the scale from Data which exemplifies data on the scale from

Clearly identifiableClearly identifiable … to … … to …

Difficult to identifyDifficult to identify

102102

What is it realistic to What is it realistic to expectexpect??

Exemplars that showExemplars that show Some of the problems that are easier to Some of the problems that are easier to

solvesolve The many problems that are difficult to The many problems that are difficult to

solvesolve Directions for useful further researchDirections for useful further research How not to repeat previous errorsHow not to repeat previous errors

103103

Some personal thoughtsSome personal thoughts Oversimplification is a common pitfall Oversimplification is a common pitfall

to be avoidedto be avoided Looking at real data, one finds that Looking at real data, one finds that

emotion is oftenemotion is often Difficult to describe in simple termsDifficult to describe in simple terms Jumps between modalities (text might Jumps between modalities (text might

be considered a separate modality)be considered a separate modality) Extremely dependent on context, Extremely dependent on context,

character, settings, personalitycharacter, settings, personality A task so complex for humans cannot A task so complex for humans cannot

be easy for machines!be easy for machines!

104104

SummarySummary Speech is a major channel for Speech is a major channel for

signaling emotional informationsignaling emotional information And lots of other information tooAnd lots of other information too

HUMAINE will not solve all the issues HUMAINE will not solve all the issues involvedinvolved We should focus on those that can We should focus on those that can

benefit most from the expertise and benefit most from the expertise and collaboration of its memberscollaboration of its members

Examining multiple modalities can prove Examining multiple modalities can prove extremely interestingextremely interesting