40
V B FA Ú FA DE STRESS RECO URČOVÁNÍ STRESU Z ŘE SHORT VERSION ZKRÁCENÁ VERZE DOKT AUTHOR AUTOR PRÁCE SUPERVISOR VEDOUCÍ PRÁCE BRNO 2016 VYSOKÉ UČENÍ TECHNICK BRNO UNIVERSITY OF TECHNOLOGY AKULTA ELEKTROTECHNIKY A KOMUNIKAČN ÚSTAV RADIOELEKTRONIKY ACULTY OF ELECTRICAL ENGINEERING AND EPARTMENT OF RADIO ELECTRONICS OGNITION FROM SPEECH EČOVÉHO SIGÁLU N OF DOCTORAL THESIS TORSKÉ PRÁCE MIROSLAV STANĚK Prof. MILAN SIGMUND KÉ V BRNĚ NÍCH TECHNOLOGIÍ D COMMUNICATION H SIGNAL

VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

VYSOKÉ UČENÍ TECHNICKÉ VBRNO UNIVERSITY OF TECHNOLOGY

FAKULTA ELEKTROTECHN

ÚSTAV RADIOELEKTRONIKY

FACULTY OF ELECTRICAL ENGINEERING AND COMMUNICATION

DEPARTMENT OF RADIO ELECTRONICS

STRESS RECOGNITION URČOVÁNÍ STRESU Z ŘE

SHORT VERSION OF ZKRÁCENÁ VERZE DOKTO

AUTHOR AUTOR PRÁCE

SUPERVISOR VEDOUCÍ PRÁCE

BRNO 2016

VYSOKÉ UČENÍ TECHNICKÉ VBRNO UNIVERSITY OF TECHNOLOGY

FAKULTA ELEKTROTECHNIKY A KOMUNIKAČNÍCH

ÚSTAV RADIOELEKTRONIKY

FACULTY OF ELECTRICAL ENGINEERING AND COMMUNICATION

DEPARTMENT OF RADIO ELECTRONICS

RECOGNITION FROM SPEECH SIURČOVÁNÍ STRESU Z ŘEČOVÉHO SIGÁLU

SHORT VERSION OF DOCTORAL THESIS ZKRÁCENÁ VERZE DOKTORSKÉ PRÁCE

MIROSLAV STANĚK

Prof. MILAN SIGMUND

VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ

IKY A KOMUNIKAČNÍCH TECHNOLOGIÍ

FACULTY OF ELECTRICAL ENGINEERING AND COMMUNICATION

SPEECH SIGNAL

Page 2: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

Keywords

Digital signal processing, speech signal processing, emotion recognition, psychological stress, formant, vowel polygons, glottal flow analysis, glottal pulse, Return-To-Opening phase ratio, Closing-To-Opening phase ratio, COG shift, classifiers, neural networks, Gaussian Mixture Models.

Klíčová slova

Zpracování digitálního signálu, zpracování řečového signálu, rozpoznání emocí, psychologický stres, formanty, samohláskové polygony, analýza hlasivkových pulsů, RTO poměr, CTO poměr, COG posun, klasifikátory, neuronové sítě, Gaussovské smíšené modely.

Storage place

Research department, FEEC BUT, Technická 3058/10, 616 00 Brno

Místo uložení práce

Vědecké oddělení, FEKT VUT v Brně, Technická 3058/10, 616 00 Brno

ACKNOWLEDGEMENT

The research described in this treatise was performed in laboratories of the SIX Research Center, the registration number CZ.1.05/2.1.00/03.0072, the operational program Research and Development for Innovation.

© Miroslav Staněk, 2016

Page 3: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

Contents

1. Introduction................................................................................................................................................. 1

1.1. Emotions - The State of the art ............................................................................................................... 1

1.2. Stress – The State of the art .................................................................................................................... 2

2. Doctoral Thesis Objectives ......................................................................................................................... 4

3. Vowel polygons ............................................................................................................................................ 5

3.1. Algorithms development ........................................................................................................................ 5

3.2. Research in speaker recognition ............................................................................................................. 6

3.3. Psychological stress and vowel polygons ............................................................................................. 10 3.3.1. Mixed stress – experimental results ................................................................................................ 11 3.3.2. Efficiency of vowel polygons .......................................................................................................... 12 3.3.3. Summarization ................................................................................................................................ 13

3.4. Closure of vowel polygons ................................................................................................................... 13

4. Glottal pulses ............................................................................................................................................. 14

4.1. Mining the glottal pulses ...................................................................................................................... 14

4.2. Automatic estimation of glottal pulses and further filtration ................................................................ 15

4.3. Psychological stress detection .............................................................................................................. 16 4.3.1. Return-To-Opening phase ratio in time domain .............................................................................. 16 4.3.2. Experimental results ........................................................................................................................ 18 4.3.3. Closure of using RTO in glottis for stress detection ........................................................................ 20 4.3.4. Top-To-Bottom Closing-To-Opening phase ratio in amplitude domain ......................................... 21 4.3.5. Experimental results ........................................................................................................................ 22 4.3.6. Discussion ....................................................................................................................................... 24 4.3.7. Bottom-To-Top Closing-To-Opening phase ratio in amplitude domain ......................................... 25 4.3.8. Discussion ....................................................................................................................................... 26

4.4. Closure of using CTOs and RTOs ......................................................................................................... 26

5. COG shift ................................................................................................................................................... 26

5.1. Maximum Likelihood Estimation ......................................................................................................... 29

5.2. Statistical Testing ................................................................................................................................. 29

5.3. Discussion............................................................................................................................................. 31

6. Final Conclusion ....................................................................................................................................... 31

References ....................................................................................................................................................... 33

Page 4: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 1 -

1. Introduction

Language is the unique communication instrument of plants, animals, humans and other beings and can be carried out by two basic ways- lingual and non-lingual. Speech can be defined as the lingual way of communication within humans and provided by vocal tract. Due to these facts, it is called as the vocalized form of human.

Mentally or sociologically constructed processes containing subjective experiences of pleasure or displeasure can be defined as emotions [1] also influenced by hormones such as dopamine, serotonin, oxytocin, cortisol, and so on. Emotions are often disposed by the motivation, positive or negative, and followed by physiological changes, e.g. changes in heartbeat rhythm as well as in breathing and gesticulation etc., not only in human body. In the end of the 19th century, the importance of emotions in communication is mentioned in references [2] where Darwin argued that emotions are evolved by natural selection and pointed on the emotion occurrence in the animal world. The term emotion is taken from French "émouvoir" which means "to stir up". For hundreds of years, many publications oriented on emotions have been written, mostly on psychological, sociological and other behavioural emotion roots. Basically, emotions are classified as reactions on internal or external events. Physiological, behavioural, neural and verbal mechanisms are included in these responses [3].

1.1. Emotions - The State of the art

The general review of recognizing emotion in speech is presented in references [4] where the overview of used methods, speech features and obtained results is mentioned as well as the list of observed databases. In the case of emotion recognition, crucial decision is to choose the most suitable speech feature carrying unique information for each different state of speaker. Linear Prediction Coefficients (LPCs) and derivatives from LP residual can be mentioned as the basic and one of the most popular features for speech processing. The correlation of LP residual and the Glottal Volume Velocity (GVV) signals has been observed which means the valid information of vocal tract producing speech are transmitted in LP residual signal. The higher order correlations of LP residual signal may be captured to some extent by using some helpful features, e.g. characteristics of GVV waveform or open and close glottis' phases, shapes of glottal pulse etc. Mel Frequency Cepstral Coefficients (MFCCs), Perceptual Linear Prediction Coefficients (PLPCs) and formant feature are the most used and mentioned speech parameters used for actual state of speaker recognition and so on.

Another possibly observed parameter of speech is so called prosodic feature which can be defined as speech features associated with larger units (syllables, words, phrases and sentences) and is often considered as suprasegmental information [4]. Acoustically, the prosody is specified by the duration patterns, intonation (base tone F0) and energy. The combination of used features for emotion recognition is listed as well as mainly used classifiers divided into two categories- linear (e.g. Naive Bayers classifier, Fischer's linear discriminant analysis, least square method, linear support vector machine) and non-linear, such as Hidden Markov Models (HMM), Gaussian Mixture Models (GMM), soft support vector machines, neural networks, decision trees and so on, classifiers.

One more written survey on speech recognition is mentioned in references [5] containing the list of used features, classification schemes, databases etc. The speech signal processing is also described in this paper in step-by-step for obtaining the best speech feature and the best results of desired application. Generally, the speech features can be divided into four main

Page 5: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 2 -

categories: continuous, qualitative, spectral, and TEO (Teager Energy Operator)-based features.

Further useful feature descriptions, applied processes and used classifiers (e.g. SVM, HMM, GMM, decision tress etc.) can be found in literature as well as a survey on used spectral features or possibly applied normalisation as well as the the description of some databases used in the case of emotional speech corpora can be found in references [6].

1.2. Stress – The State of the art

The word stress means tension, pressure and strain. Stress can be briefly defined as the state of organism during which the subject is faced to extraordinary conditions and can be divided into two types:

Eustress – the subject reaction on positive load stimulating the subject to better performance.

Distress – the overload which can cause disease, damage or, in the worst case, destroy the subject.

Generally, it can be said that negative emotions are multiplied by the stress influence. So, the relation between stress and emotions has been introduced and, for recapitulation, expressing the emotions states like anger, guilt and joy is mostly affected by present stress. In the majority of written publication, stress and emotions have been investigated separately though the relation between stress and emotions is obvious. Due to the theory of physical stress is tantamount as the theory of emotions, these two fields can be observed together.

A survey on stress recognition in speech signal in can be found in [7] where all previously used techniques are described.

Best results in classifying speech under stress were obtained by the nonlinear model of the phonation process filling by the spectral distribution of the glottal energy Lech and He applied their recognition methods on database containing 7 speakers under stress (3 female and 4 male speakers). On so called SUSAS corpus, recognition method used 39 speech parameters related to MFCCs as the main feature, HMM as a classifier and SVM for training and adaptation functions are described in [8], where the recognition efficiency of value approaches 95% is also mentioned. Hidden Markov Models have also been used as a classifier. In written paper, other methods of stress classification, difference in suitability between different speech features, previously published results and so on, are described by Hansen. Another type of spectral analysis and the study of empirical model both used for stress recognition are described in more detail in [9].

In references, the suitability of higher frequency parts of spoken vowel for stress detection is described. Other observations have been attempted for achieving the most suitable indicator of stress occupation in speech. Under the stress influence, the changes in fundamental frequency were published as well as pitch changes, vowel duration and formants position [10]. The combination of Teager energy values and MFCCs were used for stress classification as well as another combination containing Teager energy and slope of glottal spectra [11].

A survey on previously used databases containing recorded speech under stress is occupied in this subsection. Obviously, the list of stress oriented databases is going to be more poor in the comparison with databases containing emotions because of the stress cannot be acted and all records are originated from real situations. Nevertheless, the list of previously published speech under stress databases is shown in Tab. 1.

Page 6: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 3 -

TABLE I DATABASES CONTAINING SPEECH UNDER STRESS [12]

Similar to emotion recognition systems, the market oriented on software detecting stress in speech signal is limited. In references, developed algorithms are published and are still in testing phase. Obviously, the usage of stress detection is similar to emotion recognition these both phenomena have more a less same base and can be influenced by each other.

Practical usage of stress detection can be following:

Lie detection – to observe and classify if the subject is cheating or not telling the truth.

Mental analysis – to observe if the subject can be possibly dangerous (psychopath, criminal etc.) and if the actual situation is controlled by subject (fighter pilot, personal driver and so on).

Health care – stress, exactly in the form of Post-traumatic stress disorder, can lead to heavy depressions, psychical problems and suicidal thinking. By these reasons, stress is analysed not only in the case of military health care.

Detail description and understanding of emotions and stress occupied in speech can be found in publications like Handbook of Emotions [12].

Title of database

Created in

Situation Size Language Note

SUSC-0 (ground-to-air)

- Military communica-tion. Fighter pilots in

ascent

11 males, record of length 15 minutes for each

speaker

English Non-native speakers

SUSC-0 (Aircraft crash)

- Ejecting off the aircraft

23 minutes English Poor quality

SUSC-0 (F-16 Engine out)

- Successful engine-out

landing

15 minutes English -

SUSC-1 (Physical stress)

- Fair physical load (running up and

down three floors)

10 males, 10 females, 10 times repeated 2

sentences in 10 different days

English Phone quality.

Tolkmitt and Scherer 1986 Answers on questions

33 males and 27 females German Three vocal responses

SUSAS 1998 Various situations. 13 females, 19 males, 16,000 words

- Real and simulated stress.

Rahurkar and Hansen 2002 - 6 soldiers English Five stress levels

Scherer et al. 2002 Normal and stress condition

100 speakers, 2 tasks spoken by each speaker

English and German

Effects on the stress and load

McMahon et al. 2003 - 29 speakers English -

Fernandez and Picard 2003 Responses on mathematical

problems

4 subjects English -

ATCOSIM 2008 Speech of air traffic control

operator

10 speakers, length 10 hours

English Non-native

IEM-PSD 2013 Communica-tion with pilots

> 700 utterances, > 7 hours

English and German

Various stress level and situations

Page 7: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 4 -

2. Doctoral Thesis Objectives

By previous sections, existing concepts in the field of emotion recognition were introduced. Nowadays not only in the case of speech signal processing, presented and known researches are further observed for achieving better accuracy, higher efficiency or for finding new more effective way leaded to the same aim. Written doctoral thesis is presented for outlining possible new methods recognizing emotions/stress in speech.

Hence, the doctoral thesis aims were laid as follows:

To create suitable speech database.

The first idea of this doctoral thesis was to create appropriate speech database of six different emotional states and alcohol intoxication for Czech female and male native speakers. But the real conditions are not suitable for creating this large database, because it is very difficult and time-consuming to capture real emotions for the highest number of speakers. Due to this reason, our experiments are only focused on real stressed and normal states of speakers which were recorded during appropriate situations at our department, because in general it is necessary to observe the impact of novel approaches on real not on acted emotions.

To develop algorithms for obtaining desired speech features and analysing speech.

As it was mentioned, the suitability of each speech feature is different for each purpose. By observations, highest speech feature differences depending on spoken emotion have to be found for stress recognition within the created speaker database. The glottal pulses have to be also observed because so much useful information of actual state of speaker is occupied in them. In this case, the main emphasis will be oriented just on the glottal pulse analysis. Simply, the combination of used speech features and glottal pulses behaviour is going to be observed.

To develop methods for stress recognition.

By setting suitable classifier on previously obtained speech features, emotions will be recognized. Obviously, both methods (analysing and recognizing) should be robust and speaker independent. In that case, the possibilities of speaker recognition have to be also observed and developed speaker recognizing algorithms will be further modified and applied on stress detection. The efficiency of created speech processing system will be compared with other available products.

Presented research is mostly oriented on Czech language and vowels. Vowels are the special type of spoken phonemes characterized by the periodical signal form. Vowels are also generated by free air flow resonating in relevant cavities. Though the forty different spoken phonemes exist in Czech language, speech is consisted of vowels in 41.377 % (see Tab. 2) and almost every single word contains at least one vowel. Due to these reasons, research based on vowel properties can be performed.

TABLE II RELATIVE RATIO OF VOWELS OCCUPIED IN CZECH SPEECH [13]

Phoneme Occupation [%] Phoneme Occupation [%]

/e/ 9.216 /é/ 1.182

/o/ 7.904 /ú/ 0.919

/a/ 6.189 /ou/ 0.659

/i/ 6.164 /au/ 0.030

/í/ 4.571 /eu/ 0.015

/u/ 2.369 /ó/ 0.011

/á/ 2.148

Page 8: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 5 -

3. Vowel polygons

Fundamental observations were aimed mostly on the suitable speech feature extraction via created algorithms and on observation in the field of speaker recognition leading to investigation of most uniform vowel polygons within created speaker database representing the normal state of speaker at most. The basic observations were also done in emotion recognition. Provided research is described in following subsections.

3.1. Algorithms development

The first step into the emotion recognition topic is the development of vowel detecting algorithms in fluent speech because obtaining pure and true data is very important request for further processing.

In firstly provided experiments, the vowel recognition was based only on the position of first two formants determining the spoken phoneme. In Fig. 1 (left), the examples of 17 different speakers /u/ vowel spectra are illustrated for showing the fact of spoken phoneme determining by the position of formants F1 and F2, where ordinal formant frequency intervals are performed by grey shadow area. Software was created only in the form of MATLAB based console application with also developed extension modules (mainly statistics modules).

Fig. 1 Examples of /u/ vowel LPC spectrum (left) and found /u/ vowel segments in word „osum“ by basic (red lines) and improved (green lines) algorithm(right).

In the case of relatively high ratio of parasitic (false detected) vowel segments, the improvement of original algorithms was necessary. The peak error ratio was reached for /i/ vowel and it approached 40 % including false detected segments because of the relatively high leak between formants F1 and F2 for /i/ vowel possibly colliding with surrounding noise and parasitic environmental sounds. Due to this fact the original software had to be modified. By statistical observations of results received by first software version, almost all false detected vowel segments were located in the non-speech part of record and were suited in the small groups of max segments number 2. For this reason, the total length protection of found vowel segments had been implemented into original software.

Mentioned improvement is the form of retroactive checking of previously found vowel segments and it helps for erasing false detected segments and finding miss-detected vowel short-length section between two vowel parts. By this retroactive checking, the error ratio has been decreased by 38.8% at average.

The impact of improved algorithm (green boundaries) on found vowel segments by so-called basic algorithm (red boundaries) is illustrated in Fig. 1 (right) where the signal form of Czech word “osm” (spoken as “osum”) is also figured out. Obtained results and cores of

Page 9: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 6 -

algorithms were presented at international conference on Telecommunication and signal processing [14].

Further, new software system for vowel recognition in fluent speech was created for usage in speaker and emotion recognition field. This application contains Graphic User Interface (GUI) and used two levels recognition tool for vowel detection. Thirteen MFCCs, 13 velocity (delta) and 13 acceleration (delta-delta) coefficients are used as a classification speech feature in both recognition levels. Reference values of observed features were mined from created speaker database containing 13 male and 12 female speakers. More details about algorithms used in developed software can be found in [15].

The statistics tool is implemented in mentioned software system for showing parameters of found vowel formants and saving them into external data files for further possible processing. The main function of developed software system is to present obtained vowel data graphically in progressive view- so called vowel polygons. Vowel polygon is defined by desired vowels' apexes of coordinates set by reached average formant values. All possible vowel polygons (ten vowel triangles, four vowel tetragons and on pentagon) can be figured out in ten, according to total number of possibly existing formants in spectra, different formant planes. The formant plane can by defined as two dimensional space generated by firstly and secondly chosen formants values as horizontal and vertical axes

The main idea of creating vowel polygons is based on possible graphical expressions of different emotional speaker states, speaker identities and so on.

Developed software system was presented at international conference on Telecommunication and Signal Processing 2014 and it is described in more details in relevant proceedings [16].

3.2. Research in speaker recognition

Following subsection brings observations received in the field of speaker recognition. First experiments were oriented on finding the most suitable speech feature possibly used

for speaker recognition. The set of observed speech features was consisted of: four formants, four formant bands, eleven LPCs, ten LPSs and thirteen MFCCs. For created speaker database containing 12 Czech native speakers reading same text, the set of observed speech features was mined by the created text-independent previously described software. The suitability of each speech feature was observed for each individual vowel separately and its uniformity/exclusivity within speaker database was classified by statistical methods including F-ratio.

In the case of feature speaker variability, received values of F-ratio were observed by the set of tests, e.g. Bonferroni etc. The most suitable speech feature for speaker recognition is the sixth Linear Spectral Pair (LSP6) followed by LSP10 and so on. Detailed description of provided tests, observations and achieved results can be found in [17]. But briefly, on the thousands of found and separated vowel segments, observed features were mined and statistically ranked for each vowel separately. Exactly the relative deviations of calculated F and t ratios were summed for each vowel and ranked from the highest to the lowest value.

Further these partial results, exactly the calculated standard deviations, are summed over the occurred vowels and then the mean value over all possible vowels ∑avg is calculated. These final results are further ranked by the mean standard deviation value from the highest to lowest value, because the highest values as possible are necessary to express to lowest speaker uniformity depending on actual feature within the database, which is wanted in the case of speaker recognition.

Page 10: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 7 -

Fig. 2 The illustration of normalised and real small vowel triangle in the F1-F2 plane (left) and achieved area differences for AIO vowel triangle (right).

Experiments in speaker recognition used properties of vowel polygons were also used for finding the most variable feature within the speaker database. First steps of research were oriented on the observations of generated normalisation vector between reference and real vowel triangle, exactly between their centres of gravity [18]. Figure 2 (left) illustrates the situation of normalisation vector v generation between reference (solid green line) and real (dashed blue line) small vowel triangle in F1-F2 plane.

In performed experiments, so-called big (AIU) and small (AEO) vowel triangle were used. Reference vowel triangles were generated by ordinal formant values of Czech vowels obtained by statistical measurements [13] and real vowel triangles were generated for each speaker by individual average formant values. For the created speaker database containing 12 male Czech speakers, the uniformity of created normalisation vectors were observed in the total number of three different formant planes (F1-F2, F1-F3 and F2-F3). The speaker uniformity was observed in length and angle criterion of vector v by averaged absolute differences between each normalisation vector with average value of observed parameter over speaker database. The biggest differences in length (ΔdAVG) and angle (ΔαAVG) criterion were reached for both vowel triangles in F2-F3 formant plane, but the best results were achieved by small vowel triangle. Furthermore, the uniformity of normalisation vector parameters was classified by statistical method ANOVA. Results reached by statistical testing are listed in Tab. 3. Generally, recognizing speakers by both parameters of normalisation vector created for small vowel triangle together is the best choice. All intermediate results of provided experiments, detail description and laid conclusion can be found in references [18].

TABLE III RESULTS OF PARAMETERS TESTING USED ANOVA

Parameter Big triangle Small triangle

F-ratio [-] p-value

[-] F-ratio [-] p-value [-]

d 6.232 9e-5 2.821 0.016

α 2.031 0.071 0.521 0.870

d, α (Two-way ANOVA)

3.089 6e-4 8.371 7e-10

Developed software system [16] was used for achieving new results in the field of speaker recognition continuing on preliminary results [18]. The new observations were based on are differences between real and reference vowel triangles [19]. For created speaker database

Page 11: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 8 -

containing 13 male and 12 female Czech native speakers, reference values of higher formants F4 and F5 were achieved. Average Czech vowel formant values were used as the apexes of reference vowel triangles. Firstly, the measurement was focused on the estimation of the amount of vowel signal needed for calculating reliable formant parameters. Experimental results show that a data set of approximately 3000 values (i.e., 3000 vowel frames) satisfies statistical reliability.

Further, research method based on relative differences between reference and real vowel triangle area was applied on another created speaker database consisted of 12 different male Czech native speakers. The total number of 10 vowel triangles can be created but in provided experiments were used only nine vowel triangles due to line character (zero area) of AEO vowel triangle in all planes containing the third formant which leads to infinite area differences. Owing to five formants can be possibly occurred in spectrum, experiments were provided for ten different formant planes. The consistence of achieved area differences was classified by statistical indicator called coefficient of variation. Due to including the standard deviation of actual selection, the consistence of area differences dS is established by R ratio for all speakers. The high suitability of actual vowel triangle, exactly of its area difference to reference pattern, is represented by the possibly lowest value of R ratio. Vowel triangles are placed in order in Tab. 4 in the case of the coefficient of variation value, where triangles are titled by vowel created their apexes and two numbers sign formant plane.

TABLE IV VOWEL POLYGONS RANKED BY ACHIEVED COEFFICIENT OF VARIATION

Order Triangle R [-] Order Triangle R [-]

1. AIO15 0.09 …

2. AEU15 0.11 86. AOU24 1.37

3. AIU15 0.13 87. AIU45 1.38

4. EOU15 0.14 88. AIO23 1.40

5. EIO25 0.17 89. AIU35 1.49

… 90. AOU45 2.03

The curves behaviour for AIO vowel triangle over all formant planes is illustrated in Fig. 2 (right) where the current maximal area difference is performed by solid black line. Actual minimal area difference related to current formant plane is represented by black dotted line and the average area difference dSavg is figured out by grey dashed line. Even the area differences are not high in comparison with other formant planes, the most suitable formant plane for speaker recognition using AIO vowel triangle is F1-F5 because of the lowest R value (see Fig. 2 (right)).

Obtained results and research method were presented at international conference and are described in more detail in relevant proceedings [19].

Further research oriented on speaker recognition is based on all possible vowel polygons (10 triangles, 4 tetragons and 1 pentagon) in ten different formant planes [20]. Exactly, the dispersion vector d is generated for each possible couple (two different speakers) of real Centres Of Gravity (COGs) and its significant length values are observed. Figure 3 shows the example of situation containing COGs of EIU12 as the expression of method idea. It has to be said, all possible vectors d are not shown in the case of better illustration.

Page 12: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

Fig. 3 The illustration of possible vectors

The length of each vector possible vector couples are observed. For the total number olength difference is achieved from the symmetric matrixdiagonal are null. The desired parameter polygon for speaker recognition is definedmatrix Pdiff. In the case of differences Δdmin has to be maximized.

Presented method was appliedby 13 male and 12 female speakers and separately spoken vowels by another 12 malespeakers. The best minimal vector length difference is achieved byin formant plane F2-F5, where the value oftriangle is caused by the same COG coordinates of two female speakersObviously, the minimal values of formant F1 which sets together with F2 the spoken vowel.illustrated in Fig. 4 for the EIO13 vowel triangle. The identical centroids causing the null value of Δdmin are marked by purple real COGs for EIO25 vowel triangle which reafigures, the real positions of male fluent speech is represented bymale speakers are labelled by blue colour.

The suitability of vowel polygons for speaker recognition can be sorted and observed by another criterion called dispersion coefficient slightly influenced by extreme valuesrecognition is performed by the minimal the least speaker uniformity is reached in F3Absolutely highest speaker uniformity hathe statement of vowel determination by the position ofCOGs distribution for the worstFig. 5.

By presented results, the final statements can be laid. The most suitable vowel polygons for speaker recognition are EIOU25 tetragon and IOU34 vowel triangle, which reached one of the least δ and one of the highest recognition is the vowel triangle and the least speaker uniformity has been reached in higher formant planes, while the absolutely unsuitable formant plane for speaker reF2 determining especially spoken vowel. All presented results of COGs dpublished in [20].

- 9 -

The illustration of possible vectors d created by centroids of EIU12 vowel triangle.

The length of each vector d is calculated and then the length differences between all possible vector couples are observed. For the total number of speakers N, the minimal vector length difference is achieved from the symmetric matrix where all elements of the main diagonal are null. The desired parameter Δdmin representing the suitability of relevant vowel polygon for speaker recognition is defined as the minimal value of upper triangular part of

. In the case of satisfied speaker recognition, the value of observed length has to be maximized.

Presented method was applied on created speaker database consisted of fluent tby 13 male and 12 female speakers and separately spoken vowels by another 12 male

The best minimal vector length difference is achieved by EIO vowel triangle F5, where the value of Δdmin approaches 19 Hz. The null

triangle is caused by the same COG coordinates of two female speakersObviously, the minimal values of Δdmin are achieved in formant planes containing the first formant F1 which sets together with F2 the spoken vowel. The distribution of COGs is

the EIO13 vowel triangle. The identical centroids causing the null are marked by purple circle. Figure 4 (right) demonstrates the distribution of EIO25 vowel triangle which reached the best value of Δd

, the real positions of COGs for female speakers are circles filled by green colour, fluent speech is represented by red filled circles and separately spoken vowels by other

blue colour.

The suitability of vowel polygons for speaker recognition can be sorted and observed by another criterion called dispersion coefficient δ, which represents the relative variability ratio slightly influenced by extreme values. The best vowel polygon suitability forrecognition is performed by the minimal δ value reached by EIO34 vowel trianglethe least speaker uniformity is reached in F3-F4 formant plane in the case of Absolutely highest speaker uniformity has been reached in F1-F2 formant plane which proved the statement of vowel determination by the position of first two formants.COGs distribution for the worst (left) and the best (right) vowel polygons is illustrated in

ed results, the final statements can be laid. The most suitable vowel polygons for speaker recognition are EIOU25 tetragon and IOU34 vowel triangle, which reached one of the

and one of the highest Δdmin values. Generally, the most suitable shape for speaker recognition is the vowel triangle and the least speaker uniformity has been reached in higher formant planes, while the absolutely unsuitable formant plane for speaker reF2 determining especially spoken vowel. All presented results of COGs d

created by centroids of EIU12 vowel triangle.

is calculated and then the length differences between all , the minimal vector

where all elements of the main representing the suitability of relevant vowel

as the minimal value of upper triangular part of satisfied speaker recognition, the value of observed length

on created speaker database consisted of fluent text spoken by 13 male and 12 female speakers and separately spoken vowels by another 12 male

EIO vowel triangle approaches 19 Hz. The null Δdmin for EIO13

triangle is caused by the same COG coordinates of two female speakers, see Fig. 4 (left). are achieved in formant planes containing the first

distribution of COGs is the EIO13 vowel triangle. The identical centroids causing the null

demonstrates the distribution of Δdmin. In following

female speakers are circles filled by green colour, separately spoken vowels by other

The suitability of vowel polygons for speaker recognition can be sorted and observed by , which represents the relative variability ratio

polygon suitability for speaker vowel triangle. Generally,

the case of δ criterion. F2 formant plane which proved

first two formants. The comparison of vowel polygons is illustrated in

ed results, the final statements can be laid. The most suitable vowel polygons for speaker recognition are EIOU25 tetragon and IOU34 vowel triangle, which reached one of the

values. Generally, the most suitable shape for speaker recognition is the vowel triangle and the least speaker uniformity has been reached in higher formant planes, while the absolutely unsuitable formant plane for speaker recognition is F1-F2 determining especially spoken vowel. All presented results of COGs distribution were

Page 13: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 10 -

Fig. 4 The distribution of COGs positions for EIO13 (left) and EIO25 (right) vowel triangles.

Fig. 5 The distribution of COGs positions for AEO12 (top) and EIO34 (bottom) vowel triangles.

3.3. Psychological stress and vowel polygons

All experiments oriented on observing the differences between normal speech and speech under psychological stress influence were provided on special created database. The first idea of this created database was presented in [21]. Basically, the more a less same text is spoken by each speaker in normal mood and in situation causing stress influence. For introducing observations of psychological stress influence in speech, the total number of 18 Czech male speakers was recorded during the final exam and the defence of master’s thesis where the stress influence is assumed.

Primary observations were also based on properties of vowel polygons, exactly if properties of vowel polygons generated from stressed speech tend to same behaviour. In primary experiments, this prediction was confirmed. Firstly, the differences of vowel polygons COGs positions between normal mood and stress influence were observed. This subsection is divided into three parts oriented on three different stress types- middle, high and mixed stress. As it has been uncovered in previous section, different stress types have to be analysed and observed separately for giving significant proofs because higher stress level

1200 1300 1400 1500 1600 1700 18003100

3200

3300

3400

3500

3600

3700

3800

3900

F2 [Hz]

F5

[H

z]

EIO25

500 550 600 650 700 750 800 850 900 950 10001100

1200

1300

1400

1500

1600

1700

F1 [Hz]

F2

[H

z]

AEO12

2100 2200 2300 2400 2500 26002800

2900

3000

3100

3200

3300

3400

F3 [Hz]

F4

[H

z]

EIO34

40 500 600 70 80 90 100200

210

220

230

240

250

260

270

280

F1 [Hz]

F3 [Hz]

EIO1

x 2

Page 14: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 11 -

leads to higher speaker uniformity in created vectors. The core of provided experiments is created by cross-correlation of chosen vowel polygon’s parameters couples for achievement obvious relations between them. Nowadays, the cross-correlation is ordinary used in the speech processing in the field of emotion recognition, speaker and speech identification. Following results are colour distinguished to sign different couple of cross-correlated parameters. For simplification in following text, these colours also represent used experimental methods. Green colour signs the cross-correlation of difference area value and vector length (method 1), light blue represents signum of area difference and vector length (method 2), dark blue difference area value and vector angle (method 3), orange signum of area difference and vector angle (method 4). The last two methods can be called mean methods, because of the usage of previously reached results received by two together related methods. Method 5 (purple) is defined as the mean of green and dark blue progressions. Light brown represents method 6- the mean of light blue and orange values.

3.3.1. Mixed stress – experimental results

Previously uncovered results obviously lead to the achievement of higher cross-correlation values for middle stress level influence. Cross-correlation results received for middle stress level are also more consistent, or uniform, than for higher stress influence which slightly disproves the original idea based on increasing cross-correlation values relating to higher stress level. Following results presented in this subsection are received by mixing middle and high stress level influence. The boxplot of experimentally achieved cross-correlation coefficient values over all possible shapes is illustrated in Fig. 6. Obviously, the worst results of cross-correlation values uniformity are received currently for mixed stress states.

The highest uniformity of received results in the plane criterion is received for both mean methods (purple and light brown) leading to following figure, Fig. 7 (left), where a huge difference between mean and other methods is evident and moreover, the R reaches higher and less suitable values for other methods (green, light blue, dark blue, orange) similarly to R in over shapes, which signs high possible failure of methods 1-4 if they are chosen to stress detection. By these statements and previously introduced findings, the high level separation of mean methods from the others is expected in plane expression of R in plane and shape criterion.

Fig. 6 Cross-correlation values over all vowel shapes for all used methods applied on mixed (middle + high) stress influence.

Page 15: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 12 -

As the final presentation of partial results (see Fig. 7 (left)), Fig. 7 (right) is created. Figure 7 (right) shows mentioned huge uniformity difference between mean and other methods for plane as well as shape criterion. Generally for mixed stress, R in both criteria reaches higher value than in previous cases, but mean methods (purple and light brown) are more uniform which leads to their absolute separation from other methods (see Fig. 7 (right)).

Fig. 7 Right: Reached R in formant planes criterion applied on mixed stressed speech (logarithmic y-axis). Left: Plane figuring out reached R for mixed stress influence. Both mean methods (purple

and light brown) are absolutely separated from the others due to high results uniformity

Recently introduced results can be partially summarized by following observations. The direction and length of generated vector by normal and stressed COGs are more equal with raising stress level within the speaker database. But the movement of observed couples of parameters is more a less individual for each speaker with higher stress level, with decreasing stress influence the cross-correlation of area difference and vector length or its direction is higher within speakers. Generally, the most consist and significant differences between normal speech and stress influence are reached by both mean methods (purple and light brown) in shape as well as in the plane criterion. The advantage of these both mean methods significantly increases with higher stress level.

3.3.2. Efficiency of vowel polygons

In following sections, the suitability of stress detection will be observed for each possible vowel polygon separately because of not so significant results were achieved only in separated shape or plane criterion. The suitability, exactly the most significant and consist differences, are classified by their current efficiency which is based on results presented in previous section. Generally, the efficiency of observed parameter x is defined by equation

2

2

xx

, (1)

which can be modified in our case of usage as following equation of efficiency coefficient Ec

planeshape

cRR

CCVE

2

, (2)

where CCV is previously calculated cross-correlation value for selected couple of observed parameters for current vowel polygon, Rplane is variation coefficient of relevant formant plane and Rshape of shape. Briefly, the value of efficiency coefficients signs the strength of observed couple of parameters for actual vowel polygon referred to statistical values over all relevant planes and shapes. The strength of observed vowel polygon is directly proportional to the Ec value- with increasing Ec rises the impact of current vowel polygon over others similar and relevant.

Page 16: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 13 -

3.3.3. Summarization

This part presents experimentally achieved values of efficiency coefficient Ec for each vowel polygon for 3 different stress groups and 6 different observation methods.

As it has been presumed, the significant general decrease of all Ec values is obvious and leading to statement the only applicable method to mixed stress detection is method 5 using the mean of area difference value – vector length and area difference value – vector angle cross-correlations. Despite the adverse Ec decrease, the fifth method reaches values higher than some others for partial stress classification.

Similarly to previous cases, the best results are achieved by method 1, method 5 and method 6 with the exception that only method 5 can be further usable because some of its Ec values reach values higher than 1. The general best shape choices are AIO, EIO vowel triangles and AEIO vowel tetragon for mixed stress classification as well as formant planes F1F5, F2F5 and F2F3. Absolutely best result is achieved by AIO15 vowel triangle reaching the Ec value. By results listed in this section, the usage of vowel triangles and formant planes containing the formant F5 can be finally evaluated as the best choices for stress detection.

3.4. Closure of vowel polygons

In this section were presented differences within speakers as well as differences within vowel polygon parameters and their mutual correlation between normal speech and stress influence. The ExamStress database [21] was used in introduced experiments and further divided into two groups- middle and high stress level. These two different groups were finally merged together for creating mixed stress group. The relationships between observed parameter couples were observed by cross-correlation coefficient and statistical parameter called variation coefficient (R) for investigating the suitability of reached result over formant planes and vowel shapes. By these observations was achieved fact that means methods (method 5 and 6) do not reached the highest cross-correlation values but are the most suitable over all vowel shapes and formant planes. This fact is validated for all stress groups and comparing to other method its biggest impact is for mixed stress.

It was proved some vowel polygons are not suitable for stress detection due to their low cross-correlation value and low uniformity in shape and formant plane criterion. It was also proofed that the lower formant planes contain foremost information about spoken phoneme and higher information of speaker’s state and identity are attenuated. As the best vowel shape proves to be AIO, AIU, AEU vowel triangles and AEIU, AEIO and EIOU vowel tetragons for stress detection. Obviously, the best formant planes for stress detection are F1F5, F2F5 and F3F5. In conclusion, stress can be possibly uncovered by usage of mentioned vowel shapes and formant planes (leaded to a various number of vowel polygons) by the fifth experimental method. Obtained results can be further practically applied in call centres, customer services, hospital and security facilities, etc.

It was also uncovered that all vowel polygons are not suitable for speaker recognition, but the most uniform vowel polygons within speaker database can represent the normal state of speaker as the best.

Page 17: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 14 -

4. Glottal pulses

The second part of this doctoral thesis describes novel methods based on glottal flow, exactly on glottal pulse, analysis for emotion recognition. Due to amount of known emotions, the research of the glottal pulse analysis is oriented on psychological stress detection. As it has been written, not so many analysis of psychological stress in speech were published and observed, thus this doctoral thesis tries to be innovative in used methods as well as in the usage of database containing real psychological stress.

The usage of glottal pulses analysis can be found e.g. in biomedical applications. Recently, the detection of Parkison’s disease from dysphonia measurements is described as a promising intermediate phase to non-diagnostic diagnostic method. Glottal pulses can be also utilized for analysis vocal disorders, alcohol intoxication as well as for Alzheimer’s disease detection. In general, a survey oriented on glottal source processing and its applications has been written by Drugman et al. [22].

4.1. Mining the glottal pulses

Despite the years to be the research of obtaining the real glottal course from speech signal worked on, recently best results are reached only for the base of glottal flow estimation. Glottal flow can be characterized by a set of glottal pulses repeated by fundamental period T. An example of glottal flow is illustrated in Fig. 8. Briefly, the whole glottal pulse is composed by two instances– primary opening To and return (closing) Tr phase. The space between particular pulses is called closed phase Tco during which the glottis is closed and the air does not flow through the gap. Detail description of each individual glottal flow part including physical changes and processes can be found in [23].

Fig. 8 Illustration of glottal pulses series and its description.

The mostly used methods for the estimation of glottal flow are Direct Inverse Filtering (DIF) and Iterative Adaptive Inverse Filtering (IAIF). The shape and the correctness of mined glottal pulses do not depend on the duration of spoken vowel or, in general, of sonorous phoneme. But, the most significant differences between mined pulses are dependent on set parameters of analysis in Aparat software.

All speech records are sampled by 8 kHz. If not, they are resampled and further analysed. Due to this reason, the total number of possibly occurred formants in spectrum is four with possible one extra formant. This fact leads to its confirmation by mined glottal pulses where achieved raw glottal pulses are the most uniform for four possibly occurred formants, optionally for five formants. Exactly the total number of five formants will be used in further experiments.The differences of mined glottal pulses varying on the length of spoken phoneme and correctly set estimation algorithms were presented, but the differences within the chosen part of analysed phoneme were not described. For better observations of differences between

Page 18: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 15 -

glottal pulses on the central part and beginning of analysed vowel, obtained pulses are normalised in the time and amplitude domain to value 1.

4.2. Automatic estimation of glottal pulses and further filtration

The automatic glottal pulse estimation is based (as in previous cases) on modified Aparat software [23], where the estimation parameters are set to optimal values, exactly the number of occurred formants is set to 5 (according to sampling frequency of records), lip radiation is set to 0.99 (the middle value) and all cut-off filters are disabled for reaching the most raw sound to analysis, because some theories allege that infrasonic frequencies are related to glottal flow. For better illustration, the flowchart of described process is shown in Fig. 9.

Fig. 9 The flow chart of glottal pulses automatic estimation and processing algorithm.

Input speech signal is sampled by 8 kHz and quantized to 16 bit, if not it is resampled and re-quantized to desired values. The format of input signal is mono. If the record is captured in stereo format, right and left channel are mixed into mono format. Practical observations showed that there is no difference between glottal pulses estimated for each channel separately or for mixed channels. If the format of input speech signal is correct, the signal is segmented into parts of duration 300 ms with 50% segment overlapping. This duration of each segment was achieved experimentally, when this duration seems to provide best results in glottal pulse estimation (correct pitch calculation on selected segment), high amount of achieved pulses and short time for their estimation and signal analysis.

Then, glottal pulses are estimated for each sonorous part of speech signal (independent on spoken phoneme) on each individual segment by chosen method (DIF/IAIF) and, after processing of all segments, each glottal pulse is normalised in amplitude domain to maximum value 1 as well as in time domain to reach the most uniform dataset. The example of all normalised mined glottal pulses befor whole input speech signal processing is illustrated in Fig. 10 (left). Obviously, achieved dataset contains a significant parasitic pulses and estimation mistakes which have to be removed for further processing. Due to these facts, the filtration algorithm should be designed. Each pulse GP is interpolated by N points, analysed and removed from the dataset if at least one condition is fulfilled:

Page 19: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 16 -

1.0)(

1)}{max(

1.0)1(

NG

G

G

P

P

P

, (3)

Presented conditions show that actual glottal pulse is filtered if it has more than one local peaks or it has considerable offset (more than 10 %) in its beginning or its end.

Fig. 10 Normalised glottal pulses before (left) and after (right) filtration estimated by IAIF

algorithm.

Previously illustrated glottal pulses (Fig. 10 (right)) after filtration are shown in the same figure. Obviously, set conditions and in summary the whole filtration process are very successful in the case of removing parasitic or badly estimated glottal pulses, which are generated mostly by the bad pitch detection or by the noise with sonorous character. Due to this reason, it is necessary to choose effective pitch detection method for further possible glottal pulse estimation. Without any doubt, the normalisation and filtration lead to achievement of the most uniform set of estimated glottal pulses which are perfectly suitable for further processing, analysis and observation.

4.3. Psychological stress detection

4.3.1. Return-To-Opening phase ratio in time domain

Provided experimental used were based on so-called Return-To-Opening phase ratio, exactly of its vector of set three normalised glottal pulses’ parameters. All results were published previously in research journal [24]. This subsection describes the method used for extracting chosen parameters, exactly their ratios. Basically, used method exploits only glottal pulses, composed by return and primary opening phase To and Tr (see Fig. 8). Each of mined glottal pulses is normalised to value 1 in time and amplitude domain, which leads to dimensionally uniform glottal pulses keeping original shape. In these two-dimensionally normalised pulses, the primary opening and return phase are processed separately. Both phases are transferred into relative time scale reaching the zero level at the position of current pulse’s peak and the maximum (100 %) at the end, exactly at the start, of current phase.

Used extraction method is based on the observation of both phases only for selected relative division n. Figure 11 shows the main idea of n-percentage glottal pulse processing of particular primary opening To(n) and return phase Tr(n) leading to following equation

Filter if

Page 20: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 17 -

)(

)(][no

nr

nT

TRTO , (4)

where RTO is Return-To-Opening phase ratio of current n-percentage interval always symmetric for both used phases in relative scale.

Area, skewness and kurtosis (the third and fourth standardized Pearson’s moments) are further calculated for both n-percentage intervals. Finally for each mined parameter value, the Return-To-Opening-phase ratio is calculated to sign the domination level of one n-percentage interval of current parameter.

Fig. 11 Division of two-dimensionally normalised glottal pulse into n-percentage particular intervals of opening and return phase.

Obviously, each part of pulse curve (thick line in Fig. 11) corresponding to the n-percentage division is characterized by three different (kurtosis, skewness and area) Return-To-Opening phase ratios. These feature values are further used for processing and observing differences between normal and stressed speech. Research presented in following parts of this thesis has been provided on created database containing speech under real psychological stress as well as normal speech. The first part of used database is formed by 18 different Czech speaking male speakers from ExamStress database[21] previously used for observation of vowel polygon differences varying on speaker’s state [25]. Second part of used database is formed by another 6 Czech male speakers recorded by microphone PCB 378B02 suitable for infrasonic applications and sound interface USB-9234 produced by National Instruments. All Czech speakers in both parts of used database were recorded during the thesis defence at final exam for capturing real psychological stress influence, and few days later each subject repeated the same text in more self-comfortable conditions for recording speaker’s normal mood.

For normal and stressed speech, a group of six patterns has been set as the reference values of observed features for each state of speaker. Each individual group contains mean Return-To-Opening phase feature ratios of five Czech vowels (/a/, /e/, /i/ /o/, /u/) for approx. four suitably grouped Czech speakers. Thus, the presented method can be evaluated as speaker-independent.

Page 21: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 18 -

4.3.2. Experimental results

This part of written thesis describes results achieved by realized experiments. In fluent speech performed by second part of used database (six speakers), Czech vowels were automatically detected and separated for further processing by previously developed software tool [14]. Then, separated vowels were manually divided to begging and centre vowel parts from which glottal pulses were estimated by DIF and IAIF methods. Approximately 1200 glottal pulses reached at vowel beginning and approx. 1600 glottal pulses estimated at the centre vowel part were used as a testing sequence of each mood, thus approximately 2400, respectively 3200, glottal pulses were used for testing in total.

For naturally dynamic speech, the efficiency of emotional state (stress and normal mood) recognition was achieved for two types of glottal flow estimation methods (DIF and IAIF) in beginning and centre vowel part for 20 different n-percentage intervals (used step 5%). Mentioned ways of efficiency testing were also applied for each 10ms normalised speech leading to the impact observation of glottal pulse uniformity on dynamic range limitation. The efficiency, exactly the uniformity of glottal pulses under normal and stress conditions, is tested by six different classifiers embedded in standard MATLAB version and further appropriately trained, validated and applied.

In following figures, the recognition of used various glottal pulses are illustrated as: black solid line for DIF method, vowels’ beginning black dashed line for DIF method, vowels’ beginning, normalised sound green solid line for IAIF method, vowels’ beginning green dashed line for IAIF method, vowels’ beginning, normalised sound red solid line for DIF method, vowels’ centre part red dashed line for DIF method, vowels’ centre part, normalised sound blue solid line for IAIF method, vowels’ centre part blue dashed line for IAIF method, vowels’ centre part, normalised sound.

The k-Nearest Neighbour method has been chosen as the first classifier. The best results are reached for the 5% observed interval of glottal pulses, where the most significant differences occur between normal and stressed speech in kurtosis, skewness and area ratios. Almost the efficiency of 95% is reached by DIF method using vowels’ beginning and normalised sound. Further, accuracy over 90% is reached by DIF estimation method using vowels’ beginning and the IAIF method of normalised sound vowels’ beginning for 5% and 10% selected intervals. The efficiency of stress detection of chosen classifier and actual n-pecentage interval and method is calculated as follows

100])[(

sn

cdscdn

NN

NNnEfficiency , (5)

where Nn is the total number of used normal state glottal pulses, Ns the total number of glottal pulses under psychological stress, Ncdn is the number of correctly detected normal mood glottal pulses and Ncds is the number of correctly classified stressed glottal pulses

As well as in previous case, the best results are also achieved for 5% observed interval of glottal pulses, where the most significant differences of pursued features exist between normal speech and speech under psychological stress. The highest accuracy of stress detection is reached by DIF estimation method using normalised sound vowels’ centre and approaches 95%. The IAIF method using normalised sound vowels’ beginning also reaches one the best accuracies, approx. 93%, for 5% and 10% observed glottal pulse intervals. In general,

Page 22: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

achieved efficiency values of stress detection reach satisfactory results, but significantly lower than in the case of usage Support Vector Machine as a classifier.classifier can be regarded generally more satisfactory with more possible used glottal pulse intervals for correct psychological stress detection. The best results approaching 95% accuracy are received by IAIF method used on normalised vowels’ beginning for 5% and from 75 to 95% selected intervals. The most significant differences between observed features ratios of normal and stressed speech can be found in 5% (DIF used on normalised sound vowels’ centre and IAIF applied on vowels’ beginning) selected intervals where

As the fourth classifier, GMM has been used. The results reached by this classifier are shown in Fig. 12. The high efficiency values of psychological stress detection are apparenover all possible n-percentage intervals of glottal pulses. On the other hand, the generally lowest accuracy values are also achieved by GMM, exactly for 10% (DIF estimation method using vowels’ centre) and 35% (DIF method applied on vowels’ begging) selwhere the accuracy approaches only to 10% in stress detection. Each of used methods and their types reaches the efficiency almost 95% which signs the highest uniformity of observed features varying on actual state of speaker, and targets onstress detection. The Probabilistic Neural Network (PNN) has been used as the fifth classifier. The highest uniformity of observed features varying on speaker’s state can be found in the usage of 5% selected interval (DIF IAIF applied on vowels’ centrenormalised sound. Absolutely highest accuracy (almost 94 by DIF estimation method used on normalised sound vowels’

Fig. 12 Accuracy of stress detection depending on selected

Feed-Forward Neural Network has been used as the ladetection in speech. As in previous cases, observed features, depending on emotional state can be precisely distinguished at 5% selected interval for DIF estimation method using normallas for IAIF estimation method using normalised sound vowels’ beginning and these method types reach the accuracy value approx. 95% at 5% used interval. The IAIF estimation method applied on vowels’ beginning approaches the accuracy 93% at 35% used interval. The accuracy boundary 90% has been also exceeded by IAIF estimation method using normalised sound vowels’ beginning at 35% and 65% used intervals. received accuracy graphs, it can be said the suitability of applied classifiers is not too

- 19 -

achieved efficiency values of stress detection reach satisfactory results, but significantly lower than in the case of usage Support Vector Machine as a classifier. Efficiency reached by SVM

rded generally more satisfactory with more possible used glottal pulse intervals for correct psychological stress detection. The best results approaching 95% accuracy are received by IAIF method used on normalised vowels’ beginning for 5%

from 75 to 95% selected intervals. The most significant differences between observed features ratios of normal and stressed speech can be found in 5% (DIF used on normalised

and IAIF applied on vowels’ centre) and 65% (IAIF applied on beginning) selected intervals where also accuracy approaches to 95 %.

As the fourth classifier, GMM has been used. The results reached by this classifier are The high efficiency values of psychological stress detection are apparenpercentage intervals of glottal pulses. On the other hand, the generally

lowest accuracy values are also achieved by GMM, exactly for 10% (DIF estimation method ) and 35% (DIF method applied on vowels’ begging) sel

where the accuracy approaches only to 10% in stress detection. Each of used methods and their types reaches the efficiency almost 95% which signs the highest uniformity of observed features varying on actual state of speaker, and targets on GMM as a suitable classifier for

The Probabilistic Neural Network (PNN) has been used as the fifth classifier. highest uniformity of observed features varying on speaker’s state can be found in the

usage of 5% selected interval (DIF and IAIF applied on normalised sound vowels’ beginning, centre) and higher intervals 75-100% used for IAIF estimation on

ly highest accuracy (almost 94 %) in stress detection is achieved ethod used on normalised sound vowels’ begging.

Accuracy of stress detection depending on selected n-percentage interval and using Gaussian Mixture Models as a classifier.

Forward Neural Network has been used as the last classifier for psychological stress detection in speech. As in previous cases, observed features, Return-To-Openingdepending on emotional state can be precisely distinguished at 5% selected interval for DIF estimation method using normally recorded and normalised sound vowels’ beginning as well as for IAIF estimation method using normalised sound vowels’ beginning and these method types reach the accuracy value approx. 95% at 5% used interval. The IAIF

ied on vowels’ beginning approaches the accuracy 93% at 35% used accuracy boundary 90% has been also exceeded by IAIF estimation method

using normalised sound vowels’ beginning at 35% and 65% used intervals. graphs, it can be said the suitability of applied classifiers is not too

achieved efficiency values of stress detection reach satisfactory results, but significantly lower Efficiency reached by SVM

rded generally more satisfactory with more possible used n-percentage glottal pulse intervals for correct psychological stress detection. The best results approaching 95% accuracy are received by IAIF method used on normalised vowels’ beginning for 5%

from 75 to 95% selected intervals. The most significant differences between observed features ratios of normal and stressed speech can be found in 5% (DIF used on normalised

) and 65% (IAIF applied on vowels’

As the fourth classifier, GMM has been used. The results reached by this classifier are The high efficiency values of psychological stress detection are apparent percentage intervals of glottal pulses. On the other hand, the generally

lowest accuracy values are also achieved by GMM, exactly for 10% (DIF estimation method ) and 35% (DIF method applied on vowels’ begging) selected intervals

where the accuracy approaches only to 10% in stress detection. Each of used methods and their types reaches the efficiency almost 95% which signs the highest uniformity of observed

GMM as a suitable classifier for The Probabilistic Neural Network (PNN) has been used as the fifth classifier.

highest uniformity of observed features varying on speaker’s state can be found in the and IAIF applied on normalised sound vowels’ beginning,

100% used for IAIF estimation on %) in stress detection is achieved

percentage interval and using

st classifier for psychological stress pening phase ratios,

depending on emotional state can be precisely distinguished at 5% selected interval for DIF y recorded and normalised sound vowels’ beginning as well

as for IAIF estimation method using normalised sound vowels’ beginning and centre. All of these method types reach the accuracy value approx. 95% at 5% used interval. The IAIF

ied on vowels’ beginning approaches the accuracy 93% at 35% used accuracy boundary 90% has been also exceeded by IAIF estimation method

using normalised sound vowels’ beginning at 35% and 65% used intervals. By comparison of graphs, it can be said the suitability of applied classifiers is not too

Page 23: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 20 -

important as of used interval in the case of stress recognition by introduced Return-To-Opening phase ratios. The final sorting of used types of stress detection parameters is listed in Tab. 5, where due to total number 960 of all used types, only the first and last five positions are listed. All types are written only in abbreviations, e.g. GMM_5_D_C_N represents GMM classifier applied on 5% selected interval, DIF estimation method, the vowels’ centre and normalised sound.

TABLE V FINAL SORTING OF USED TYPES

# Type Eff. [%]

# Type Eff. [%]

1 SVM_5_I_B_N 95,7 … 2 GMM_5_D_C_N 95,5 956 KNN_5_I_C 11,8 3 GMM_5_I_B_N 95,5 957 GMM_35_D_B 8,5 4 GMM_5_I_C 95,4 958 PNN_70_I_C_N 8,2 5 GMM_65_I_B_N 95,4 959 PNN_80_I_C_N 8,2

… 960 KNN_5_D_C_N 8,2

4.3.3. Closure of using RTO in glottis for stress detection

In this part of written thesis, differences between normal speech and speech under psychological stress were introduced in observed features. By the division of two-dimensionally normalised glottal pulses into primary opening and return phase, each individual feature vector is created by chosen n-percentage intervals. Each feature vector contains Return-To-Opening phase ratios of kurtosis, skewness and area of current n-percentage glottal pulse interval.

Six different classifiers have been appropriately trained to reaching best accuracy in stress detection at created training and experimental speech database. By achieved results can be said, the most suitable classifier for psychological stress detection in speech is GMM, where accuracy exceeds 90% at the most frequent, but also GMM reaches the poorest detection efficiency in some cases and, in general, the GMM efficiency is the most fluctuating over all possible used glottal pulse intervals. Other appropriate classifiers are SVM and FFNN reaching also high accuracy values, exactly approx. 95% in some cases.

The main observations of this paper can be found in results received in stress detection. The highest differences between normal and stressed speech can be found in 5% used interval, which leads to the highest observed features uniformity varying on speaker state. Due to reaching a peak in all presented graphs, the 65% used interval can be also used for effective stress detection in speech. Generally, presented approach corresponds with the similar method detecting stress by means of glottal pulse distribution. However, presented experiments show higher accuracy (95%) as the accuracy of 88% published in [26].

Finally, achieved results can be concluded in few statements as follows:

normalised sound leads to better stress detection stress influence is better detectable at vowels’ beginning IAIF estimates more significant differences varying on emotional state accuracy 90% and higher can be approached by using the suitable classifier.

Obviously, the combination of automatic vowel detection, e.g. [14], and findings presented in this paper can lead to development new systems recognizing psychological stress in speech which can negatively influence human behaviour. These systems can be practically applied in

Page 24: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 21 -

many fields of usage e.g. machine control, medical applications, etc. Further, it is necessary to expand real psychological stress database to verify presented results.

4.3.4. Top-To-Bottom Closing-To-Opening phase ratio in amplitude domain

This section presents and describes all realized introduction steps and options necessary for observation of differences between stressed and normal speech. Applied methods and glottal pulse estimation are based on use of the software Aparat [23].

The glottal pulses were estimated from speech signal by two common algorithms - the Direct Inverse Filtration (DIF) and the Iterative and Adaptive Inverse Filtering (IAIF) both applied on originally captured and normalised records at vowels’ beginning and centre parts. All methods are described in subsection related to RTOs.

Two different databases were used in presented experiments. Firstly, 12 male Czech native speakers were randomly selected from previously created database ExamStress [21] where the same speech is recorded during the final exams (stress influence) and a few days later (normal state) for each speaker. Due to this reason, the differences between normal state and real psychological stress can be observed. Secondly, the SUSAS [27] database was used for validating the psychological stress detection efficiency on English language and bad quality captured records containing high noise level, voice distortion and signal clipping. Exactly, the part containing real psychological stress performed by 2 apache pilots almost out of fuel was used in presented experiments.

Used glottal pulse features

Differences between stressed and normal speech were observed and further classified by vector of three glottal pulse features. In first step, each mined glottal pulse is amplitude and length normalised to maximum values 1 due to bring the global pulse size into accord. Then, each normalised glottal pulse is divided into series of pulse segments from peak to n-percentage amplitude level which is shifted step by step along amplitude axis. The 0 % level is at the top of glottal pulse, and 100 % value lies at its bottom (see Fig. 13). Due to this fact, used glottis division is called as Top-To-Bottom (TTB).

Fig. 13 The example of glottal pulse n-percentage division where selected closing phase is illustrated by dark grey and opening phase is represented by light grey colour. Thick line marks

the chosen curve part of glottal pulse.

The selected n-percentage pulse segment is further analyzed in the time domain into opening and closing phase. Thus reached glottis parts are further analyzed, exactly for each part is calculated its kurtosis α, skewness β and area γ. For reaching the domination of each obtained parameter, Closing-To-Opening phase ratio (CTO) is calculated which means the p supplementing parameter value of glottal closing phase is divided by its opening phase equivalent which can be written as

Page 25: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 22 -

)(

)(])[(

_

_

nT

nTnCTO

po

pc

p , (6)

where n is actual n-percentage level, p is one of the analyzed parameter (i.e. skewness, kurtosis or area), Tr_n is current return phase value and To_p is current opening phase value.

4.3.5. Experimental results

This section describes realized experiments and achieved results. For introduction, some achieved results for TTB criterion were published in research journal [28]. In the training process, 5 Czech vowels were found [14] in recorded speech from 6 speakers in ExamStress database to obtain reference CTO values in the vowels’ beginning and centre part by its averaging over all vowel. This fact means that for each speaker and vowel are stored 3 reference CTO values leading to 90 reference values totally for each state of speaker and used method. These reference values are further used for training the GMM classifier. The second group of 6 speakers was used for testing the designed classifier. Due to the records with high quality and their length, approximately 1500 glottal pulses were analyzed and classified for each speaker.

For investigating the language-independency of presented methods, the SUSAS database was used. Comparing to ExamStress database, the low quality (e.g. voice distortion, clipping, loud background noise, etc.) of these records rapidly decreases the total number of estimated glottal pulses. It has been observed experimentally that only processing of short parts (50 ms) of SUSAS records leads to satisfied glottal pulse mining. Other lengths of analyzed speech signal lead to estimation of glottal pulses which do not match the Liljencrants-Fant model [29]. All mined glottal pulses were checked manually, because incorrectly estimated glottal pulses are occurred even through the short length of analyzed signal. For each speaker in SUSAS, approximately 130 glottal pulses were received correctly and further used irrespectively of sound normalisation and vowels’ parts to psychological stress detection.

By comparison of results reached on ExamStress database, the negative influence of sound normalisation can be seen by significant decrease of efficiency over all observed n-percentage intervals. This effect is not so evident for SUSAS database where almost each reached efficiency is rapidly lower than its ExamStress equivalent, except the 80% and 100% observed intervals for Method 3, respectively 15% and 55% chosen intervals, where the stress detection efficiency reaches 95 %.

For the ExamStress database, efficiency is more or less similar and approaches values over 90 %. But for higher n-percentage intervals than 50%, efficiency slightly decreases to value 77 %, exactly to 74 %. By application of Method 5 and Method 6 on SUSAS database, similar efficiency is reached as for ExamStress database and achieves high values almost over all n-percentage intervals. The only exceptions can be found in in 45% selected interval, respectively in 65% and 80% chosen intervals, where both methods obtained poor and unsatisfactory efficiency. The DIF glottal pulse estimation algorithm can be found also appropriate for psychological detection. The effect of sound normalisation on stress recognition can be also classified as minimal as well as the effect of low quality records and spoken language capture on analyzed records.

Page 26: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

Fig. 14 Reached efficiency for TTB

As in previous case for ExamStress database, the IAIF estimation algorithm reaches generally lower recognition efficalmost all selected n-percentage intervals. By the application of Method 7 and Method 8 on SUSAS database, the recognition efficiency generally sharply decreases by 20 % at average, except on 4 n-percentage intervals where reaches much higher values than for ExamStress database. All received efficiency values for GMM based (left) for ExamStress database and in Fig. signed by black solid line, method 2 by black dashed line, method 3 by method 4 by green dashed line, methodmethod 7 by blue solid line and method 8 by used methods is the same as in previous cases and it will be use further in following textObviously as in previous case, the IAIF algorithm is so sensitive on analyzed records quality, eventually on spoken language. But methods babegging are more suitable for psychological stress detection than Method 7 and Method 8.

To conclude results listed in previous section, it is necessary to make final evaluation of used methods and n-percentage inten-percentage intervals is appropriate to find the most consecutive glottis parts where the highest differences between normal and stressed speech are occurredefficiency ε values reached for each selected databases.

TABLE VI AVERAGE EFFICIENCY Ε

n [%] 5 10ε [%] 83.2 81.8

n [%] 40 45ε [%] 77.6 74.2

n [%] 75 80ε [%] 85.1 78.4

Obviously, the best n-percentage TTB amplitude interval lies between 5 and 40 % where average efficiency ε reaches cothe estimation method criterionobserved n-percentage intervals

- 23 -

Reached efficiency for TTB CTOs, ExamStress (left) and SUSAS (right)GMM classifier.

As in previous case for ExamStress database, the IAIF estimation algorithm reaches generally lower recognition efficiency than DIF algorithm, but it still gives satisfactory results

percentage intervals. By the application of Method 7 and Method 8 on SUSAS database, the recognition efficiency generally sharply decreases by 20 % at average,

percentage intervals where reaches much higher values than for ExamStress database. All received efficiency values for GMM based classifier ale illustrated in Fig.

for ExamStress database and in Fig. 14 (right) for SUSAS database, where metsigned by black solid line, method 2 by black dashed line, method 3 by

dashed line, method 5 by red solid line, method 6 by line and method 8 by blue dashed line. Obviously, the colour coding of

used methods is the same as in previous cases and it will be use further in following textObviously as in previous case, the IAIF algorithm is so sensitive on analyzed records quality, eventually on spoken language. But methods based on IAIF algorithm applied on vowels’ begging are more suitable for psychological stress detection than Method 7 and Method 8.

To conclude results listed in previous section, it is necessary to make final evaluation of percentage intervals. Firstly, the evaluation of all investigated

percentage intervals is appropriate to find the most consecutive glottis parts where the highest differences between normal and stressed speech are occurred. Table

values reached for each selected n-percentage interval over all used methods and

VERAGE EFFICIENCY Ε VALUES OVER ALL 8 METHODS AND BOTH DAT

10 15 20 25 81.8 84.4 77.5 78.4

45 50 55 60 74.2 80.9 80.8 83.2

80 85 90 95 78.4 79.1 73.5 74.4

percentage TTB amplitude interval lies between 5 and 40 % where reaches consequently higher values than 77.5 %. Results

the estimation method criterion show not so significant difference between reached percentage intervals between two used methods varying only in the application of

(left) and SUSAS (right) database and

As in previous case for ExamStress database, the IAIF estimation algorithm reaches iency than DIF algorithm, but it still gives satisfactory results

percentage intervals. By the application of Method 7 and Method 8 on SUSAS database, the recognition efficiency generally sharply decreases by 20 % at average,

percentage intervals where reaches much higher values than for ExamStress classifier ale illustrated in Fig. 14

atabase, where method 1 is signed by black solid line, method 2 by black dashed line, method 3 by green solid line,

line, method 6 by red dashed line, y, the colour coding of

used methods is the same as in previous cases and it will be use further in following text. Obviously as in previous case, the IAIF algorithm is so sensitive on analyzed records quality,

sed on IAIF algorithm applied on vowels’ begging are more suitable for psychological stress detection than Method 7 and Method 8.

To conclude results listed in previous section, it is necessary to make final evaluation of rvals. Firstly, the evaluation of all investigated

percentage intervals is appropriate to find the most consecutive glottis parts where the . Table 6 lists average

percentage interval over all used methods and

METHODS AND BOTH DATABASES

30 35 81.9 82.6

65 70 69.1 82.5

100 80.9

percentage TTB amplitude interval lies between 5 and 40 % where Results averaged over in

show not so significant difference between reached ε exists on between two used methods varying only in the application of

Page 27: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 24 -

sound normalisation. Obviously, similar ε results are achieved by similar glottis estimation methods trained only on varying vowel part. Finally, the highest average efficiency on observed n-percentage intervals are reached by DIF estimation method (88.5 %) which achieved higher ε by significant 15.2 % comparing to IAIF estimation algorithm (73.3 %).

Discussion

By reached results for GMM classifier, the most significant differences between normal and stressed speech are occurred in 5 % to 40 % n-percentage intervals, where the average efficiency is consequently higher than 77.5 %. On this observed intervals, the average efficiency ε were calculated and the highest average value is reached by DIF glottal pulses estimation algorithm which approaches 88.5 % contrary to IAIF algorithm reaching significantly lower ε (73.3 %). According to all achieved results, it can be said the recognition based on sound normalisation is sensitive on the sound quality, on spoken language respectively, as well as the IAIF estimation algorithm. The vowel’s part used for classifier training does not make significant effect on recognition efficiency.

In conclusion, the usage of presented system and CTOs of glottal pulses estimated by DIF estimation algorithm, exactly applied on TTB n-percentage intervals from 5 % to 40 %, can lead to high efficiency psychological stress recognition in speech. Obviously, presented method could be classified as text and language-independent, but it is necessary to verify achieved observations on other languages and expanded speaker database in future.

Top-To-Bottom Closing-To-Opening phase ratio and other classifiers

As for RTO feature, the efficiency and the appropriateness TTB-CTO for psychological stress recognition in speech was observed by set of another five classifiers. Chosen classifiers are the same as for RTO. That means the Probabilistic Neural Network (PNN), Feed-Forward Neural Network (FFNN), k-Nearest Neighbour (kNN), Support Vector Machine (SVM) and Decision Tree (DT) are used in further experiments applied on ExamStress and SUSAS database. All classifiers were trained and tested as previously described GMM classifier.

4.3.6. Discussion

This subsection presents the usage of TTB CTOs as the main set of features for psychological stress detection in speech. By presented results, a few facts can be laid. The best results are reached by methods using DIF estimation algorithm. These methods achieved very high recognition efficiency with the lowest variability. The methods using vowel’s beginning for initialization process achieved better results, but in general, the arbitrary part of analysed vowel/phoneme will bring very satisfy efficiency with, as it has been said, DIF based methods. Another observation of provided experiments is the fact of efficiency decreasing caused by short-time sound normalisation. The TTB CTOs were tested by the set of designed classifiers and two speech databases (Czech and English) to prove how these features are successful in stress detection in speech. The most recommended combination is to use TTB CTOs with FFNN, SVM or GMM classifiers which reached the best recognition results (approaching 95 %) for both databases. This fact leads to possibility of language and text independency of presented methods. To prove this statement it is necessary to test TTB CTO features on expanded language varying databases.

Page 28: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 25 -

Presented results also show that the most significant differences between speaker’s state can be found in the first half of analysed n-percentage intervals. Exactly the best results are achieved in the n-percentage analysed interval range from 5 to approx. 45 %. By comparison of obtained results, it can be said, that TTB CTOs are more appropriate to stress detection in speech than previously presented RTO features.

4.3.7. Bottom-To-Top Closing-To-Opening phase ratio in amplitude domain

The next way of provided glottal pulse analysis as a distribution is similar to previously introduced TTB CTOs. This type of analysis is also directed by division into n-percentage intervals in amplitude domain with the difference that the 100% level is at the maximum of analysed pulse. This way of glottal pulse analysis is called Bottom-To-Top Closing-To-Opening phase ratio abbreviated as BTT CTO. As in previous case for the selected n-percentage interval of the analysed glottal pulse, the set of three ratio features (i.e. kurtosis, skewness and area) are calculated and further used for classification of actual emotional state of speaker. Obviously, this type of analysis can be defined as the complement to TTB CTOs. An example of analysed glottal pulse divided into n-percentage interval and two phases according to BTT definition is illustrated in Fig. 15, where the analysed part of glottal pulse is bounded by thick black line and the relevant phase parts are filled by light (opening) and dark (closing) grey colour. The calculation of CTO values is the same as for TTB criterion (see previous section).

Opening phase Closing phase

n

c_pT

Time

o_pT

100

0

Fig. 15 An example of n-percentage division for BTT criterion of CTO features.

As in previous case, the BTT CTO features were calculated for the first part of Czech ExamStress database. These values were used for training of designed classifier with further data validation. Then, the second part of Czech ExamStress [21] database was used for psychological stress detection in speech by designed classifier as well as the English SUSAS database [27]. Used methods (method 1-8) are also the same as in previous experiments with the same colour coding. This process description is very brief, because it is the same as for TTB criterion, where the whole recognition process is described in more detail. The set of used classifiers is also same as for TTB criterion, with the difference of classifier’s inner structure modification according to best recognition results. The set of used classifiers is the same as in previous cases.

Page 29: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 26 -

4.3.8. Discussion

Using the CTO features and the division in TTB criterion was described in this subsection as well as achieved results by 8 observed estimation methods for 6 designed classifiers. The best results were achieved for the GMM classifier and all methods, further for PNN classifier and all methods (SUSAS database) and for the combination of SVM classifier and DIF based methods. By performed experiments, received results are significantly worse than for TTB criterion and the most significant paradox is the higher recognition efficiency for English language. In the case of n-percentage intervals, the most stable and the best results were achieved for higher values, i.e. observed intervals higher than 60 %.

4.4. Closure of using CTOs and RTOs

By provided experiments, the comparison of two division criterions for CTO feature and RTO features was done. Obviously, the analysis of mined and normalised glottal pulses is a good way to recognize stress in speech signal and it can be possibly successful for other emotions (need further experimental proof). The achieved results show that the psychological detection in speech based on the analysis of glottal pulses can be language and phoneme independent. This statement was proved by analysis of glottal pulses mined over whole captured signal (all voiced parts of signal) for Czech and English language. This statement should be further proved by more experiments based on other languages.

In general, the most suitable and successful are CTOs divided in TTB criterion. The RTOs achieved very satisfactory results as well, but due to time requirements, the efficiency was not observed on SUSAS database. Obviously, both sets of features reached the best recognition results for lower values of analysed intervals which lead to the fact that the biggest differences in state of speaker are in the surrounding of glottal pulse’s maximum. According to achieved results, the GMM, FFNN and SVM can be marked as the most successful classifiers and the DIF (exactly method 1 and method 3) based methods can be marked as the best way for stress recognition in speech using CTO or RTO features. Previous text implies that the division of glottal pulse in the BTT criterion is so suitable way how to detect psychological stress from speech signal. Anyway, presented observations should be further verified on expanded databases.

5. COG shift

The last investigated method of stress recognition in speech was observed as shift of the Centre Of Gravity (COG) in glottal pulses. Exactly, the differences between normal and stressed speech were observed for chosen set of speakers in created ExamStress database. Thus, the differences in COG shift were reached for Czech native male speakers, exactly for 27 speakers in normal state and for 37 speakers under the stress influence. The higher amount of speaker under stress is caused because it is much easier to capture their presentation within the final exam defense than few days after its passing. In general, the COG shift can be defined as a difference towards the peak of normalised glottal pulse depending on speaker’s state. That means each individual glottal pulse is normalised in amplitude and time domain and further is divided by the peak’s position to opening and closing phase, as in previous cases. Then, the COG is calculated for each individual pulse and its position, exactly the horizontal coordination (relative length) is used for further calculation. The COG shift w is defined by following equation

, (7)

Page 30: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 27 -

Page 31: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 28 -

Page 32: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 29 -

Page 33: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 30 -

Page 34: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 31 -

Results achieved by statistical testing proved some observations evident by previously presented results. For methods which do not use short-time sound normalisation, the null hypothesis was rejected in all cases. Thus, method A and method B (including their absolute value modification) are appropriate for stress recognition in speech signal, because experimentally achieved w values have different means, variances, distributions etc. for each speaker’s state. This observation of varying leads to further successful usage of COG shift feature with some designed appropriate classifier in the case of stress recognition in speech signal.

In some cases for χ2 goodness of fit testing, the alternative hypothesis should by accepted for method using sound normalisation, but this hypothesis should be accepted only if the dataset for normal state is defined as theoretic or empiric set- not for both. Due to this fact, there is uncertainty to reject the null hypothesis, thus, the better solution is to accept the null hypothesis saying that both distributions are the same. This acceptation can be proved in the glance of expert on graphical representation of mined data.

5.3. Discussion

The appropriateness of using the COG shift as feature for stress recognition was observed by application of chosen statistical tests. These tests observed possible statistical values over the whole population, which are not so significantly different from values of experimental datasets, and also observed if the COG shift varies within the speaker’s state. The best results were achieved by method B (IAIF estimation) and its absolute value modification. Worse, but still very good, results achieved method B (DIF estimation) and its absolute value modification. For these methods, all statistical tests proved that distribution and its parameters are significantly different for normal and stressed speaker’s state. Thus, the COG shift criterion obtained by these methods could be further used as the main or accompanying feature for stress recognition in speech. The appropriateness of COG shift feature for other emotion recognition should be experimentally tested on relevant speech databases.

6. Final Conclusion

The original aim of presented doctoral thesis was to develop algorithms for emotion recognition in speech signal. Due to difficulty of capturing real emotions for the largest group of speakers as possible, presented research was focused only on psychological stress recognition in speech signal. All experiments were observed on created database containing captured speech in normal state and under psychological stress of Czech native speakers. Real stress records were captured during the final exam of our students. The normal state was capture, almost for all speakers, few days later in more comfortable environment. Thus, presented results can uncover real differences in speech signal which could be questionable for databases containing acted emotional states. Presented doctoral thesis tries to by innovative by introducing methods of speech analysis in frequency and time domain. Firstly, the research was focused on the vowel analysis. For this reason, the algorithm for vowel segments were developed and found vowels from fluent speech were further analysed in frequency domain. Exactly, this research was observed on formant analysis leading to generation of two-dimension objects from achieved formant values, called vowel polygons. The initial analysis of vowel polygons, exactly of vowel triangles, was performed for observation which vowel triangles provides the highest uniformity within the speaker database to found the optimal vowel polygon representing normal speaker’s state with appropriateness for speaker identification as a side product. In time domain, the analysis of speech signal was focused on processing the glottal flow signal estimated by two commonly used estimation methods. Each estimated glottal pulse is separated from the glottal flow signal

Page 35: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 32 -

and then it is normalised in amplitude and time domain and further divide into two parts- closing and opening phase. The basic idea is to analyse glottal pulses as two-dimensional objects, e.g. probability distribution, by calculating relevant parameters (skewness, kurtosis, area) of both phases. Of course, each phase, exactly the whole pulse, is firstly divided into analysed intervals by chosen criterion in time or amplitude domain. In general, TTB CTOs can be very successful in the case of stress recognition if the appropriate classifier and method (mostly DIF based methods) is chosen. The best recognition results are reached for the majority of analysed n-percentage intervals which leads to possibly stable recognition results. Comparing to the RTOs, the TTB CTOs is more successful in stress recognition and this method can be possibly language and speech independent, because it reached very satisfactory stress recognition results (higher than 90 %) for English as well as for Czech language. Further experiments have to be focused on this statement to prove if TTB CTOs is language-independent and if it can be used for other emotions.

The last method of psychological stress recognition in speech signal was based on statistical analysis of COG shift which can be defined as a distance and appropriate direction between parallel lines cutting the Centre Of Gravity and the pulse’s peak. Obviously, glottal pulses are also analysed as two-dimensional objects in this case as well. The datasets of COG shift are received by four different methods (and four modified methods) for Czech speakers from ExamStress database. By the set of performed statistical tests, COG shifts received for non-normalised in short time criterion speech signal can be successfully used for stress recognition in speech. The best results were achieved for IAIF based estimation methods, but the DIF reached very good results as well and both can be successfully used as a new feature for stress recognition in speech, because means, variances, distributions etc. are significantly different for normal and stressed state of speaker. This cannot be said for methods using short-time sound normalisation which leads to high uniformity of mined glottal pulses between individual speaker’s states. Thus, the COG shift can be possibly used as the main or accompanying feature for stress recognition and it should be further observed on other languages and another speaker’s states to uncover similar facts as for CTOs. To conclude this doctoral thesis it has to highlight the novel approaches of glottal flow analysis which achieved great recognition efficiency results and seems to be successful for further research in the case of language and phoneme independency. Algorithms presented in this doctoral thesis are unique because of the processing of speech features, which are mostly further processed as two-dimensional graphic objects. In the case of speech processing, exactly for emotion (stress) recognition, the usage of graphic vowel polygons and glottal pulses analysed as distribution has never been observed and, from achieved results, these methods should be new directions in relevant fields of speech processing.

Future work should be focused on further development of presented methods. For example, vowel polygons should be expanded by another dimension (another formant) for three-dimensional object analysis, or the longest side of vowel polygon should be used as a rotation axis for creating another three-dimensional object for analysis and so on. In the glottal pulse criterion, the future work more concentrated on the similarity of mined glottal pulses with distribution. In this case, the confidence intervals, median, modus and mean values should be observed similar to process engineering in manufacture. Of course, the glottal flow should be inversed and the close part should be analysed in similar ways presented in this doctoral thesis. These algorithms, especially algorithms based on the processing of glottal pulses, are suitable to be implemented into microprocessors or digital signal processors for almost real time stress recognition in speech due to their relative simplicity, and therefore can be further used in various applications (e.g. professional pilots monitoring, health monitoring, lie detection etc.).

Page 36: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 33 -

References

[1] S.J.C. GAULIN AND D.H. MCBURNEY, Evolutionary Psychology. Boston: Prentice Hall, 2003, ISBN 978–0–13–111529–3.

[2] C. DARWIN, The Expression of the Emotions in Man and Animals. London: John Murray, 1872.

[3] E. FOX, Emotion Science: An Integration of Cognitive and Neuroscientific Approaches to Understanding Human Emotions. Palgrave MacMillan, 2008, ISBN 978–0–230–00517–4.

[4] S.G. KOOLAUGUDI AND S.R. KROTHAPALLI, “Emotion Recognition from Speech: A Review,” International Journal on Speech Technology, vol. 15, no. 2, pp. 99–117, 2012.

[5] M.E. AYADI, M.S. KAMEL, AND F. KARRAY, “Survey on Speech Emotion Recognition: Features, classification Schemes and Databases,” Pattern Recognition, vol. 44, pp. 572–587, 2011.

[6] Z. ZIXING, F. WENIGER, M. WOLLMER, AND B. SCHULLER, “Unsupervised Learning in Cross-Corpus Acoustic Emotion Recognition,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Waikoloa (Haiti), 2011, pp. 523–528.

[7] N. SHARMA AND T. GEDEON, “Objective Measures, Sensors and Computational Techniques for Stress Recognition and Classification: A Survey,” Computer Methods and Programs in Biomedicine, vol. 108, no. 3, pp. 1287–1301, 2012.

[8] B. VLASENKO, B. SCHULLER ET AL., “Balancing Spoken Content Adaptation and Unit Length in the Recognition of Emotion and Interest,” in Proc. Conference of the International Speech Communication Association INTERSPEECH 2008. Brisbane (Australia), 2008, pp. 805–808.

[9] L. HE, M. LECH, N.C. MADDAGE, AND N.B. ALLEN, “Study of Empirical Mode Decomposition and Spectral Analysis for Stress and Emotion Classification in Natural Speech,” Biomedical Signal Processing and Control, vol. 6, no. 2, pp. 139–146, 2011.

[10] S. MOON AND B. LINDBLOM, “Interaction between Duration, Context, and Speaking Style in English Stressed Vowels,” Journal of the Acoustical Society of America, vol. 96, no. 40, 1994.

[11] A.I. ILIEV, M.S. SCORDILIS, J.P. PAPA, AND A.X. FALCO, “Spoken Emotion Recognition through Optimum-Path Forest Classification Using Glottal Features,” Computer Speech & Language, vol. 24, no. 3, pp. 445–460, 2010.

[12] M. LEWIS, J.M. HAVILAND-JONES, AND L.F. BARRETT, Handbook of Emotions. New York: The Guilford Press, 2008, ISBN 978–1–59385–650–2.

[13] J. PSUTKA, L. MÜLLER, J. MATOUŠEK, AND V. RADOVÁ, Mluvíme s počítačem česky. Prague: Academia, 2006, ISBN 80-200-1309-1, in Czech.

[14] M. STANĚK AND L. POLÁK, “Algorithms for Vowel Recognition in Fluent Speech Based on Formant Positions,” in Proc. 36th International Conference on Telecommunication and Signal Processing (TSP) 2013. Rome (Italy), 2013, pp. 521–525.

[15] R.C. SCHNELL AND F. MILINAZZO, “Formant Location from LPC Analysis Data,” IEEE Transactions on Speech and Audio Processing, vol. 1, pp. 129–134, 1993.

[16] M. STANĚK, “Software for Generation and Analysis of Vowel Polygons,” in Proc. 37th International Conference on Telecommunication and Signal Processing (TSP) 2014. Berlin (Germany), 2014, pp. 424–427.

[17] M. STANĚK AND M. SIGMUND, “Porovnání efektivity řečových spektrálních parametrů pro identifikaci mluvčích,” Elektrorevue- Internetový časopis, vol. 2013, no. 8, pp. 1–8, 2013, in Czech.

[18] M. STANĚK AND M. SIGMUND, “Speaker Dependent Changes in Formants Based on Normalization of Vowel Triangle,” in Proc. 23th International Conference RADIOELEKTRONIKA 2013. Pardubice (Czech Republic), 2013, pp. 337–341.

[19] M. STANĚK AND M. SIGMUND, “Comparison of Speaker Individuality in Triangle Areas of Plane Formant Spaces,” in Proc. 24th International Conference RADIOELEKTRONIKA 2014. Bratislava (Slovakia), 2014, pp. 1–4.

Page 37: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 34 -

[20] M. STANĚK AND M. SIGMUND, “Speaker Distinction Using Vowel Polygons: Experimental Study,” in Proc. 25th International Conference RADIOELEKTRONIKA 2015. Pardubice (Czech Republic), 2015, pp. 125–128.

[21] M. SIGMUND, “Introducing the Database ExamStress for Speech Under Stress,” in Proc. 7th Nordic Signal Processing Symposium NORSIG 2006. Reykjavik (Iceland), 2006, pp. 290–293.

[22] T. DRUGMAN, P. ALKU, A. ALWAN, AND B. YEGNANARAYENA, “Glottal source processing: From analysis to applications,” Computer Speech & Language, vol. 28, no. 5, pp. 1117–1138, 2014.

[23] M. AIRAS, “TKK Aparat: an environment for voice inverse filtering and parameterization,” Logopedics, phoniatrics, vocology, vol. 33, no. 1, pp. 49–68, 2008.

[24] M. STANĚK AND M. SIGMUND, “Psychological Stress Detection in Speech Using Return-To-Opening Phase Ratios in Glottis,” Elektronika ir Elektrotechnika, vol. 21, no. 5, pp. 59–63, 2015.

[25] M. STANĚK AND M. SIGMUND, “Finding the Most Uniform Vowel Polygon Behavior Caused by Psychological Stress Influence,” Radioengineering, vol. 24, no. 2, pp. 604–609, 2015.

[26] M. SIGMUND, A. PROKEŠ, AND Z. BRABEC, “Statistical analysis of glottal pulses in speech under psychological stress,” in Proceedings of EUSIPCO, Lausanne (Switzerland), 2008, pp. 1 5.

[27] J.H.L. HANSEN AND S. BOU-GHAZALE, “Getting Started with SUSAS: A Speech Under Simulated and Actual Stress Database,” in Proc. EUROSPEECH 1997. Rhodes (Greece), 1997, pp. 1743–1746.

[28] M. STANĚK AND M. SIGMUND, “Analysis of Closing-To-Opening Phase Ratio in Top-To-Bottom Glottal Pulse Segmentation for Psychological Stress Detection,” Elektronika ir Elektrotechnika. Accepted for publication, 2016.

[29] G. FANT, J. LILJENCRANTS, AND Q. LIN, “A four-paramter model of glottal flow,” STL-QPSR, vol. 26, no. 4, pp. 1–13, 1985.

Page 38: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 35 -

Curriculum Vitae

Name: Miroslav STANĚK

Born: September 13th 1987 in Litoměřice

Contact: [email protected]

Education

2012 – 2016 Brno University of Technology / Department of Radio Electronics Ph.D. study of Electronics and Communication Dissertation on Stress Recognition from Speech Signal

2010 – 2012 Brno University of Technology / Department of Radio Electronics Master’s study of Electronics and Communication Diploma thesis on Multimedia Signal Processing

2011 – 2013 Brno University of Technology / Department of Forensic Engineering Master’s study of Real Estate Engineering

2007 – 2010 Brno University of Technology / Department of Radio Electronics Bachelor’s study of Electronics and Communication Bachelor’s thesis on Voltage Source Controlled and Fed by USB Bus

1999 – 2007 Gymnázium Josefa Jungmanna, Litoměřice

Experience

8/08 – 12/15 ALS Czech Republic Employee of logistic affairs

4/2016 – Honeywell Technology Solutions Software Design Engineer at Honeywell Aerospace

Languages

English, German

Page 39: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz

- 36 -

Abstract

Presented doctoral thesis is focused on development of algorithms for psychological stress detection in speech signal. The novelty of this thesis aims on two different analysis of the speech signal- the analysis of vowel polygons and the analysis of glottal pulses. By performed experiments, the doctoral thesis uncovers the possible usage of both fundamental analyses for psychological stress detection in speech. The analysis of glottal pulses in amplitude domain according to Top-To-Bottom criterion seems to be as the most effective with the combination of properly chosen classifier, which can be defined as language and phoneme independent way to stress recognition. All experiments were performed on developed Czech real stress database and some observations were also made on English database SUSAS. The variety of possibly effective ways of stress recognition in speech leads to approach very high recognition accuracy of their combination, or of their possible usage for detection of other speaker’s state, which has to be further tested and verified by appropriate databases.

Abstrakt

Předložená disertační práce se zabývá vývojem algoritmů pro detekci stresu z řečového signálu. Inovativnost této práce se vyznačuje dvěma typy analýzy řečového signálu, a to za použití samohláskových polygonů a analýzy hlasivkových pulsů. Obě tyto základní analýzy mohou sloužit k detekci stresu v řečovém signálu, což bylo dokázáno sérií provedených experimentů. Nejlepších výsledků bylo dosaženo pomocí tzv. Closing-To-Opening phase ratio příznaku v Top-To-Bottom kritériu v kombinaci s vhodným klasifikátorem. Detekce stresu založená na této analýze může být definována jako jazykově i fonémově nezávislá, což bylo rovněž dokázáno získanými výsledky, které dosahují v některých případech až 95% úspěšnosti. Všechny experimenty byly provedeny na vytvořené české databázi obsahující reálný stres, a některé experimenty byly také provedeny pro anglickou stresovou databázi SUSAS.

Page 40: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ - vutbr.cz