7
Caller Identification by Voice Marcin Witkowski, Magdalena Igras, Joanna Grzybowska, Pawel Jaciow, Jakub Galka and Mariusz Ziolko AGH University of Science and Technology Department of Electronics, Krak6w PL-30059 {witkow I migras I gjoanna}@agh.edu.pl, [email protected] ,{jgalka I ziolko}@agh.edu.pl Index Terms-Speaker Verification, Speaker Recognition, Speaker Identification, Emotion Detection, Age Detection, Acous- tic Background Detection Abstract-The aim of our work is to develop the software for caller identification or to create his characteristic by analysis of his voice. Based on collected speech samples, our system aims to identify emergency callers both on-line and off-line. This home- land security project covers speaker recognition (when speaker's speech sample is known), speaker's gender, age detection and recognition of emotions. Proposed system is not limited to bio- metrics. The goal of this application is to provide an innovative, supporting tool for rapid and accurate threat detection and threat neutralization. This complex system will include: a speech signal analysis, an automatic development of speech patterns database and appropriate classification methods. I. INTRODUCTION Dialling an emergency number is usually the first and the easiest way to get help in any critical situation. Even though these numbers are different all around the world, a huge effort is being made to improve quality of that essential service which has to be very fast and as reliable as possible thanks to the current technology level. However, the basic idea of those systems is to connect the caller with a responder who classifies particular notification and assigns it to the re- lated emergency service. Even with the best communication technology the experience and abilities of operators are crucial during the process of notification. Every second of emergency call may be considered as a delay between asking for help and sending necessary services, therefore the performance of responders has a direct influence on public security. Psychically exhaustive type of work leads to losing con- centration. Emergency responders have to answer and process huge amount of information during a single call. Besides determination of caller identity, they have to create a note, which contains wide description of a notification. Preparing such note may take a long time, for example when a noti- fier is intoxicated, emotionally aroused or has problem with verbalization. Therefore any system that effectively supports such a process reduces call time and increases effectiveness of responders. Consequently they may concentrate on a proper conversation and efficient providing immediate and appropriate help in emergency situations. The aim of our work is to develop a support tool that will be implemented in Polish Emergency Call Centers (ECC). Our software system will operate on acoustic signals obtained from telecommunication channel. This requirement has direct 978-1-4799-3700-4/14/$31.00 ©2014 IEEE influence on a quality of analyzed signal, i.e acoustic band limited to 300 - 3400 Hz, minimal bitrate equal to 12 kbps and reduced amount of information as a result of compression. Secondly, the solution has to be robust since it presumes operation on a speech signal that is degraded by phone line and mostly distorted with noisy environmental background sounds. The graphical interface is also an important issue in such applications. Since it is assumed to accelerate work of an emergency caller, it should be very clear and quickly provide information that can be read intuitively with a minimal effort. In this paper we present advances in development of the complex system that identifies caller by voice char- acteristics. System will integrate multiple pattern recognition techniques, that will help in gathering information, that may not be noticed during an emergency call. In section II and III we present respectively the prototype of the interface of end- users application and collected corpora, which includes real recordings of emergency calls. Section IV contains description of researched methods of acoustic signal analysis in context of the project. II. GENERAL SYSTEM DESCRIPTION The software, that will be the result of this research, will be able to be integrated with ECC existing software. It is assumed that a caller signal is the only input of the system. The output may be adjusted to ECC software or work as a separate tool. An automatically recognized profile of a caller will be outputted directly to a responder via designed interface and stored to a dedicated database. The main goal of caller identification software is to present clearly and intuitively a caller profile, which contains specific features, that may be extracted by acoustic analysis of a speech sample. Diversified quality and noised type of acquired signal have a strong impact on efficiency in detection of each feature. Therefore it is necessary to include confidence or likelihood indicators for each recognized feature. The interface is dedicated to work in a real-time, during an emergency call. Values of each parameter will be shown immediately after their detection or recognition. Simultaneous archiving in a text form should be the optional feature set by the system administrator. There is a lot of information that may be obtained from analysis of speaker voice signal. Based on literature studies and experience of Digital Signal Processing Group at AGH

[IEEE 2014 XXII Annual Pacific Voice Conference (PVC) - Krakow, Poland (2014.4.11-2014.4.13)] XXII Annual Pacific Voice Conference (PVC) - Caller identification by voice

  • Upload
    mariusz

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [IEEE 2014 XXII Annual Pacific Voice Conference (PVC) - Krakow, Poland (2014.4.11-2014.4.13)] XXII Annual Pacific Voice Conference (PVC) - Caller identification by voice

Caller Identification by VoiceMarcin Witkowski, Magdalena Igras, Joanna Grzybowska, Pawel Jaciow, Jakub Galka and Mariusz Ziolko

AGH University of Science and TechnologyDepartment of Electronics, Krak6w PL-30059

{witkow I migras I gjoanna}@agh.edu.pl, [email protected] ,{jgalka I ziolko}@agh.edu.pl

Index Terms-Speaker Verification, Speaker Recognition,Speaker Identification, Emotion Detection, Age Detection, Acous­tic Background Detection

Abstract-The aim of our work is to develop the software forcaller identification or to create his characteristic by analysisof his voice. Based on collected speech samples, our system aimsto identify emergency callers both on-line and off-line. This home­land security project covers speaker recognition (when speaker'sspeech sample is known), speaker's gender, age detection andrecognition of emotions. Proposed system is not limited to bio­metrics. The goal of this application is to provide an innovative,supporting tool for rapid and accurate threat detection and threatneutralization. This complex system will include: a speech signalanalysis, an automatic development of speech patterns databaseand appropriate classification methods.

I. INTRODUCTION

Dialling an emergency number is usually the firstand the easiest way to get help in any critical situation.Even though these numbers are different all around the world,a huge effort is being made to improve quality of that essentialservice which has to be very fast and as reliable as possiblethanks to the current technology level. However, the basic ideaof those systems is to connect the caller with a responderwho classifies particular notification and assigns it to the re­lated emergency service. Even with the best communicationtechnology the experience and abilities of operators are crucialduring the process of notification. Every second of emergencycall may be considered as a delay between asking for helpand sending necessary services, therefore the performanceof responders has a direct influence on public security.

Psychically exhaustive type of work leads to losing con­centration. Emergency responders have to answer and processhuge amount of information during a single call. Besidesdetermination of caller identity, they have to create a note,which contains wide description of a notification. Preparingsuch note may take a long time, for example when a noti­fier is intoxicated, emotionally aroused or has problem withverbalization. Therefore any system that effectively supportssuch a process reduces call time and increases effectivenessof responders. Consequently they may concentrate on a properconversation and efficient providing immediate and appropriatehelp in emergency situations.

The aim of our work is to develop a support tool that willbe implemented in Polish Emergency Call Centers (ECC).Our software system will operate on acoustic signals obtainedfrom telecommunication channel. This requirement has direct978-1-4799-3700-4/14/$31.00 ©2014 IEEE

influence on a quality of analyzed signal, i.e acoustic bandlimited to 300 - 3400 Hz, minimal bitrate equal to 12 kbpsand reduced amount of information as a result of compression.Secondly, the solution has to be robust since it presumesoperation on a speech signal that is degraded by phone lineand mostly distorted with noisy environmental backgroundsounds. The graphical interface is also an important issuein such applications. Since it is assumed to accelerate workof an emergency caller, it should be very clear and quicklyprovide information that can be read intuitively with a minimaleffort.

In this paper we present advances in developmentof the complex system that identifies caller by voice char­acteristics. System will integrate multiple pattern recognitiontechniques, that will help in gathering information, that maynot be noticed during an emergency call. In section II and IIIwe present respectively the prototype of the interface of end­users application and collected corpora, which includes realrecordings of emergency calls. Section IV contains descriptionof researched methods of acoustic signal analysis in contextof the project.

II. GENERAL SYSTEM DESCRIPTION

The software, that will be the result of this research, willbe able to be integrated with ECC existing software. It isassumed that a caller signal is the only input of the system. Theoutput may be adjusted to ECC software or work as a separatetool. An automatically recognized profile of a caller will beoutputted directly to a responder via designed interface andstored to a dedicated database.

The main goal of caller identification software is to presentclearly and intuitively a caller profile, which contains specificfeatures, that may be extracted by acoustic analysis of a speechsample. Diversified quality and noised type of acquired signalhave a strong impact on efficiency in detection of each feature.Therefore it is necessary to include confidence or likelihoodindicators for each recognized feature.

The interface is dedicated to work in a real-time, duringan emergency call. Values of each parameter will be shownimmediately after their detection or recognition. Simultaneousarchiving in a text form should be the optional feature set bythe system administrator.

There is a lot of information that may be obtained fromanalysis of speaker voice signal. Based on literature studiesand experience of Digital Signal Processing Group at AGH

Page 2: [IEEE 2014 XXII Annual Pacific Voice Conference (PVC) - Krakow, Poland (2014.4.11-2014.4.13)] XXII Annual Pacific Voice Conference (PVC) - Caller identification by voice

I Acoustic background I Identty Gender and age Physical featurestall thick

1111short slim

Voice characteristics

Substance intoxication

X pathological voice

X filled pauses

X stutter

X repeated phrases:

±150

plea,ed

lI;ontentsi!tisfiec:lca~m

glod

dellgnted

//

9%

27%

16%

alarmed il5toni5tu~d

Ilrgus;ed,n;"I';annoViI!!d

fi"'lHitrat4!:d

miserable

bored tired

slow - fast I L- ---.J L --===---..:::=--- ----.J

Emotional state

Name Surname 1

Speaking speed

-Name Surname 4 6%

Name Surname 2-Name Surname 3

I Language in background

I~ Polish

Fig. 1. The interface prototype

University of Science and Technology, the following callerinformation were chosen to be extracted:

• identity;• gender and age;• emotional state;• speech rate;• physical features;• substance intoxication;• language;• acoustic background.These features were chosen in line with signal limitations

that were mentioned above.Fig. 1 presents the prototype of the interface. In the design

phase of this tool, strong impact was put to clarity to meetthe fact that interface should be used with minimal attention.Therefore, the number of graphical elements like icons, sym­bols and colourful images exceeds amount of text and numericvalues within the tool.

III. CORPORA

Thanks to Malopolska Emergency Call Center (MECC),project researchers were permitted to collect recordings fromreal database stored in ECC. Due to personal data protection,it was necessary to remove responders' channel as well asany kind of critical data - names, addresses, spoken phonenumbers or license plate numbers. The occurrence of such

data in an emergency call is random so the entire processhad to be done manually, listening to each sample carefully.To reduce the amount of work on database creation it wasdecided to perform annotation at the same time. Thereforespecial tool for "anonymization" and tagging was created forthat purpose. Each file was tagged with a unique string ofcharacters which clearly described the content according tocategories previously determined. Those categories includedthe following characteristics:

• gender;• age class (child, teenager, adult, elder);• speech rate (slow, normal, fast);• emotion (neutral, negative, positive and its intensity);• acoustic background (train, plane, animals, people,

street);• substance intoxication (alcohol, drugs);• health (pathology in respiration system or in vocal tract);• characteristic vocabulary (filled pauses);• conversation style (normal, chaotic);• origin (characteristic).

Collected corpora consists of 45 h 07 min of recordings,which includes 20 h 10 min of speech signal. It is stored asthe set of 3307 audio .wav files of length at least 30 sec. Eachfile includes mono signal sampled with sampling rate equal to8000 Hz, with 16 bit per sample resolution.

Page 3: [IEEE 2014 XXII Annual Pacific Voice Conference (PVC) - Krakow, Poland (2014.4.11-2014.4.13)] XXII Annual Pacific Voice Conference (PVC) - Caller identification by voice

iD"0

(B) U~

a 05 1 1.5Time (5)

Ie)

a 0.5 1 1.5Time (5)

['[J:~a 05 1 15

Time (5)

User voicesamples

The value of this database is crucial to our research whereasit will be possible to verify developed algorithms on a real,not artificially generated data.

Database

Fig. 2. Speaker enrollment process

'/'~~.!~ Preprocessing & =::t:"d"~~ ~ Parametrization~ Modeling

/ t'/"~ ...----J User model I

IV. SPEECH SIGNAL PROCESSING ALGORITHMS

This project leads to practical and applicable solutions.Therefore we focused on combining existing methods forcontribution tasks rather than developing new ones. Howeversome modifications of those methods had to be introduced.This section contains the description of researched methodsof speech signal processing devoted to support emergencyresponders.

A. Speaker Recognition

Speaker recognition is a process of person designationbased on his voice characteristics. The main tasks of speakerrecognition system may be diversed into verification andidentification. The aim of identification is to choose oneof many speakers basing on speech signal, whereas verificationis the process of determination whether assigned speaker waschosen correctly. In line with the particular usage specificationthose systems may be divided into text-dependent and text­independent. Text-dependent system assumes that recognitionprocess is based on a specific fixed phrase Le. each analyzedrecording contains the same sentence. In text-independentscenario speakers may be identified or verified by any ut­terance [1]. The second system is more challenging since itis much more complicated due to phonetically mismatchedvoice samples in training and recognition phases. Due tothe character of the analyzed signal, a text-independent systemis the one under development in this project.

Automatic speaker recognition systems consist of two mainfunctionalities: enrollment and identification. Aim of the en­rollment process is to create the compact set of parameters thatdiscriminates one speaker from another and may be used forfurther identification or verification. These sets, called modelsor voiceprints are created in three major steps: pre-processing,parameterization and stochastic and scattered modelling [2] .Diagram of enrollment process is shown in Fig. 2.

The aim of pre-processing phase is to designate whichparts of an analyzed signal include speaker voice and toprepare those parts for further processing. Voice detectionsystem is performed using zero crossing rate (ZCR), 4 Hz

Fig. 3. Ilustration of psychoacoustic cepstral speech parameterization; A ­speech sample, B - Mel-frequency Cepstrum (MFC), C - Mel-frequencyCepstral Coefficients (MFCC)

frequency energy modulation and variance of energy computedfor each time frame in different acoustic bands. Decisionabout presence of a voice signal is driven by those parameterscalculated for a single or multiple frames [3]. One framecontains 20 inS of signal, where speech waveform may beconsidered as a stationary process.

Speaker-dependent features are extracted in parameteriza­tion step. The are many discriminative parameters that maybe used to distinguish a speaker and they vary from low-levelto high-level. Low-level features, calculated usually per framecontain information regarding voice generation and vocal tractphysiognomy. High-level features are associated with prosodyand vocabulary used by each speaker. The first group maybe easily calculated referring to the second one since it ismostly based on frequency analysis. High-level parameters areestimated on the basis of longer periods of time, which oftenincludes reanalysis of the data obtained in low-level featureextraction. In our system we use the following low-level fea­tures: Teager Energy Operator (TEO) [4], the most common inthis kind of applications - Mel-frequency cepstral coefficients(MFCC) [1], presented in Fig. 3 - Chirp Group Delay ZeroPoles (CGDZP) [5] and, originally developed at AGH USTin Digital Signal Processing Group - Psychoacoustic WaveletFourier Transform (PWFT) [6]. High level features includefundamental frequency (Fa) contour variance, frequencies ofFl to F4 formants and characteristic pauses occurrence. Con­tours are approximated with polynomial of sixth degree.

Extracted voice features are aggregated into one matrix withdimensions dependent on analyzed file.

Stochastic modelling is the process of speaker voiceprintcreation based on extracted features. In our system we useGaussian Mixture Models (GMM) with 512 Gaussian compo­nents and left-to-right Hidden Markov Models (HMM) withnumber of states dependent on referring voice class with 3 to

Page 4: [IEEE 2014 XXII Annual Pacific Voice Conference (PVC) - Krakow, Poland (2014.4.11-2014.4.13)] XXII Annual Pacific Voice Conference (PVC) - Caller identification by voice

Fig. 4. Speaker identification process

B. Age Detection

Determining the caller's age, both when the identity of a per­son calling an emergency number is known and when itis unknown will complement the overall profile of a caller.However, there are a lot of aspects that have to be taken intoaccount while developing an automatic age recognizer fromcaller's voice.

Changes in human body with age applies also to organsinvolved in speech production. Vocal folds lengthen and lar­ynx lower, laryngeal cartilages ossify and calcify, mucousglands reduce their secretions. As to respiratory system: lungcapacity decreases, thorax stiffens and respiratory musclesweaken. The changes in nervous system affect speech rateand coordination of articulators [8]. It is beyond the scopeof this paper to describe all the important anatomical andphysiological changes, moreover all of them are still notfully explored. It is although important to remark that thereare substantial differences between the ageing of female andmale speech organs. These differences apply to both timingand extent of this process [9]. This implies that a systemfor caller identification should first of all detect the caller'sgender. Depending on this knowledge, the system will adjustits way to proceed in assigning the speaker voice to an ageclass. The age classes have to be defined separately for bothgenders, with different class boundaries and range dependingon the knowledge of the aging process.

All of the anatomical and physiological changes affectthe perception of the speaker's voice. However, these are notthe only factors that influence listener's perception of speaker'sage. There are several non-phonetic factors that are eitherrelated to a speaker, listener, speech-sample or task, that havean impact on age perception [8].

Collected MEEC speech corpus consists of recordings thatcome from real emergency situations. Recordings thus containa lot of information on the state of a caller but it can makethe automatic age detection a harder task, as it could bedifficult to decide whether the value of a determined acousticparameter is correlated more with speaker's emotions or age.

There are several studies investigating the human abilityto estimate an individual's age from different speech samples.In [10] authors report that listeners are able to assign a generalchronological age category (in this case 7 categories with 6years range for each category) to a telephonic voice with meanaccuracy of 84%.

While performing MEEC corpus collection the perceptualage estimation was done. All of the recordings were classifiedby using one of the four labels: child, teenager, adult, elderly.There were 2 listeners, one male and one female, in theirmid-twenties. Fig. 5 presents the age distribution in MEECcorpus. Most of the recordings (80%) were labeled as adults.

There are several reports in the literature concerning acous­tic correlates of speaker's voice and chronological or percep­tual age. The parameters that have been studied are inter alia:general features variation, speech rate, intensity, fundamentalfrequency, jitter, shimmer, spectral energy distribution, spectral

MultipleVerifications

1~ Preprocessing &

Parametrization

Unknownvoice sample

10 states. Emission models in each state are represented byGMM with 8 - 256 components. Number of components isrelated to the amount of training data and it is established auto­matically during training process. 30 seconds is the minimumlength of speech signal to sufficiently train HMM. This valuehas been established empirically. Training of those statisticalvoiceprints is based on standard algorithms for GMM andHMM model estimation i.e. Expectation Maximization (EM)method in this case realized with Baum-Welch estimation andadaptive algorithm Maximum A Posteriori (MAP) [7]. Thosetechniques allow to adapt voiceprints in a working system,which results in efficiency improvement of the system overtime. Models are stored in the database as a set of matriceswith model parameters.

Identification process, presented in Fig. 4, is devoted tocomparison of voice sample of unknown user to multiplevoiceprints, beforehand stored in the database, and mea­surements of the similarity between them. Based on calcu­lated similarity scores, the system decides which user fromthe database is most likely to generate given unknown voicesample. The process of identification may be divided intothree steps: pre-processing and parameterization, multiple ver­ification and scoring classification. The first step is similarto the one in the enrollment. After that, extracted features areused to calculate probability which determines how featurevector is similar to voiceprints in the database. The processof features comparison to a single model is a verification.To prevent underflow problems, probabilities are calculatedin log domain and called log-likelihoods. Assuming that Nvoiceprints were analyzed from the database, N likelihoods arecalculated in this step. Those likelihoods are then classified bySupport Vector Machine (SVM) classifier or Artificial NeuralNetwork (ANN), which lead to final score for each analyzedvoiceprint. Empirical tests showed that even 5 sec long speechsample includes enough information to sufficiently extractlow-level features and compute adequate score.

Page 5: [IEEE 2014 XXII Annual Pacific Voice Conference (PVC) - Krakow, Poland (2014.4.11-2014.4.13)] XXII Annual Pacific Voice Conference (PVC) - Caller identification by voice

80

60

~ 40

20oL-~__---" _

• negative

Dneutral

Dnegativeand positive

• positive

child teenager adult elderly

Fig. 5. Age distribution in MEEC corpora

noise or formant frequencies. The conclusions on which pa­rameters are the most important in differentiating age classesare not always consistent. More tests need to be done to learnmore about acoustic correlates of each age class. The authorsare planning to use for this purpose the English corpusTIMIT [11] as the recordings are labeled with the exactchronological age of the speaker.

C. Emotion Detection

The challenge of automatic emotion detection in speechhave earned a lot of attention in recent years, especiallyin the face of the growing popularity of voice interfaces [12].In case of telephone conversation, the vocal and lexical cuesare the only modality that carry information on the speaker.The typical situation of reporting an emergency is usuallyaccompanied by intense affect. It often impedes informationflow between caller and responder and potentially slows downthe reaction to emergency. Although the process of emotionrecognition by human perception is spontaneous and immedi­ate, its efficiency differs between individuals. The functionalityof automatic detection and displaying in real time changesof caller's emotional state is meant to support the operator withan automatic evaluation of speaker mood changes. It might behelpful particularly in a situation when a proficient exchangeof information depends on a proper attitude of responder withrespect to the caller's emotional state.In the field of decoding speaker emotional state using vocalcues we have greatly benefited from studies by Scherer et al.(e.g. [13], [14], [15]).

In the stage of MEEC corpus preparation, emotions per­ceived in the recordings were labeled in three layers of meta­data:

• valence of emotion (negative, neutral or positive);• type of emotional state (sadness, weariness, anxiety, sur­

prise, stress, anger, frustration, calm, relief, compassion,contentment, amusement, joy);

• intensity of perceived emotion (low, typical, alternating,high).

Typically more than one emotion appeared in the recording,therefore appropriate tags for all emotions subjectively noticedin the recording were archived (without taking into accounttheir order). An additional information on speaker emotionalstate is conveyed also by other tags: speech rate (slow - normal

Fig. 6. Valence categories of emotions distribution in MEEC corpora

70,------,-----,----,-------,-----,----,--------,------,

anxiety sadness stress weariness anger frustration surprise

Fig. 7. Negative emotions distribution in MEEC corpora

- fast), characteristic vocabulary (presence of filled pauses) andconversation style (normal or chaotic).

Some examples of distribution of emotion types in MEECdatabase is presented in Fig. 6 and 7. As expected, most ofthem were negative and most often included: anxiety, sadnessand stress.

Presentation of speaker emotion on the user interface willbe accomplished with VAD (Valence - Arousal Diagram)representation (in the central bottom in Fig. 1). The currentemotional state that is detected will be highlighted. For reflect­ing the previous states suitable trajectory will be displayed.

For analytical studies and training purposes other cor­pora will also be used. We have already collected a corpusof Polish acted emotional speech [16]. The database consistsof good quality audio recordings of 12 speakers (6 male,6 female): actors, drama students or volunteers. For eachspeaker the same text content (a set of words, dialogueutterances and continuous text) was recorded each time withone of the following emotions: joy, sadness, fear, surprise,anger, irony and neutral state as a reference. In total it consistsof over 3.5 hours of recordings. To measure performance ofthe designed algorithms we will use also international corpora(e.g. Berlin EmoDB [17]) for validation.

In order to build emotions models, two types of featureswill be used: prosodic features, describing (in a time domain)changes of energy and fundamental frequency (FO) of speech,accents and speech rate as well as spectral features extractedin frequency domain. In our previous works we studied bothof them it terms of automatic emotion recognition. Prosodicfeatures were examined in [18]. The latest experiments showedapplicability of parameters obtained from discrete wavelettransform with perceptual scale (PWT, [19]) and a fusion

Page 6: [IEEE 2014 XXII Annual Pacific Voice Conference (PVC) - Krakow, Poland (2014.4.11-2014.4.13)] XXII Annual Pacific Voice Conference (PVC) - Caller identification by voice

Voice sample

Voice

Fig. 8. Acoustic background extraction algorithm

of wavelet and Fourier transform (PWFT, [20]) for emotionmodeling. In the final approach a combination of these featureswill be applied. Modeling (with GMMs) and classification(several classifiers are to be tested) will be merged withparallel paths of speech processing dedicated to detect otherproperties of a speaker.

D. Acoustic background extractionDuring the conversation, caller and responder usually speak

separately. Thus when an emergency responder speaks, signalcoming from person calling contains acoustic backgroundonly. For the person speaking it is not easy to focus on listen­ing to this background so algorithm that does it automaticallywould increase the amount of information that responder maygather during an emergency call.

To properly classify and identify background in a phone callit is necessary to perform the separation between speech andother sounds in the input signal at the beginning of the signalprocessing. In telephone conversations, acoustic backgroundlevel is low and the energy of speech segments is higher thanthat of acoustic background segments. Speech segments alsotend to contain higher frequencies than segments of acousticbackground do. That is why spectral centroid (center of gravityof a spectrum) has greater values for speech signals. Based onthe above, choice of these features seems to be reasonable.Besides that, both features are simple to implement. Basedon [21] and [22] we developed an algorithm for separationmentioned above.

The algorithm, presented in Fig. 8 allows to extract acousticbackground segments from pauses in a speech recording andto save speech and acoustic background segments to separate.wav files. At the beginning, the signal is pre-processed: nor­malization and the pre-emphasis is applied (high-pass filtering,noise reduction). After that, for every frame of the signal twofeatures are calculated: the spectral centroid and the short-timeenergy. Vectors of feature values are smoothed with medianfiltering to avoid random noise (values significantly differentfrom its neighbours). Next, thresholds are computed (valuesless than threshold imply acoustic background). To do so,feature values histograms are calculated and local maxima aredetermined. Threshold is set as a weighted average of these

60,-----,---------,--------,--------,------,--------,--,------,-----------,

50

40

~ 30

20

10

opeople street TV bark baby music crowd other

Fig. 9. Acoustic background distribution in MEEC corpora

maxima. The threshold is adaptive which means it adjusts tothe recording. The recording is separated to segments basedon both threshold values. Speech segments are widened toavoid signal discontinuities. The amplitude gain is applied toacoustic background segments. Finally, speech and acousticbackground parts are saved into separate .wav files and speech­free segments will be then identified.

Fig. 9 presents statistics of background occurrence in MEECcorpora. The most common background are voices of otherpeople and sounds coming from city street.

V. CONCLUSION

Based on the preliminary data analysis using the proposedsystem, we concluded that a 30 sec long speech sample storedin the database is sufficient to identify a calling person in5 seconds. In other cases (i.e. absence of speech sample inthe data base) a caller profile is defined by advanced signalprocessing techniques enabling efficient pattern matching. Au­tomatic conclusions are based on speech signal comparisons.As a result, a caller is identified or his/her profile is described.The proposed system will be able to register and analyzeacoustic effects that people may not notice in the recordedvoice and in the acoustic background. The adaptive characterof a system aims at improving its efficiency.

We introduced the tight support system for emergency callresponders that is being developed by researchers of DigitalSignal Processing Group at AGH University of Science andTechnology. Our solution will help responders in identifyingemergency callers by voice obtained from phone channel.The methods that allow automatic gathering of the informa­tion like identity, emotional state, age, gender and acousticbackground recognition were presented. The system is alsoassumed to recognize psychical characteristics, language ofthe speaker and intoxication level, however methods for ex­traction of these features are still under study. The prototypeof the end-user graphical and intuitive interface, introducedin this paper, was created to follow the requirement of fastperformance and usage. The corpora of real audio samplescollected in Malopolska Emergency Call Centre will be usedboth for development, training and evaluation of the describedsystem.

Page 7: [IEEE 2014 XXII Annual Pacific Voice Conference (PVC) - Krakow, Poland (2014.4.11-2014.4.13)] XXII Annual Pacific Voice Conference (PVC) - Caller identification by voice

ACKNOWLEDGEMENTS

The project was supported by the National Research and De­velopment Center granted by decision 072/R/1D 1/20 13 /03.

REFERENCES

[11 T. Kinnunen and H. Li, "An overview of text-independent speakerrecognition: From features to supervectors," Speech communication,vol. 52, no. I, pp. 12-40, 2010.

[21 J. P. Campbell, "Speaker recognition: A tutorial," in PROCEEDINGSOF THE IEEE, vol. 85, no. 9, 1997, pp. 1437-1462.

[31 J. Shen, J. Hung, and L. Lee, "Robust entropy-based endpoint detectionfor speech recognition in noisy environments." in ICSLP, vol. 98, 1998,pp. 232-235.

[41 H. A Pati! and K. K. Parhi, "Development of teo phase for speakerrecognition," in Signal Processing and Communications (SPCOM), 2010International Conference on. IEEE, 2010, pp. 1-5.

[51 B. Bozkurt, T. Dutoit, and L. Couvreur, "Spectral analysis of speechsignals using chirp group delay," in Progress in nonlinear speechprocessing. Springer, 2007, pp. 41-57.

[61 J. Galka and M. Zi6lko, "Wavelet parametrization for speech recogni­tion,," in Proceedings of an ISCA tutorial and research workshop onnon-linear speech processing NOLlSP 2009, VIC, 2009.

[71 D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker verificationusing adapted gaussian mixture models," Digital signal processing,vol. 10, no. I, pp. 19-41, 2000.

[81 S. Schotz, Perception, analysis and synthesis of speaker age. LundUniversity, 2006, vol. 47.

[91 P. Torre III and J. A Barlow, "Age-related changes in acoustic charac­teristics of adult speech," Journal of communication disorders, vol. 42,no. 5, pp. 324-333, 2009.

[101 L. Cerrato, M. Falcone, and A Paoloni, "Subjective age estimation oftelephonic voices," Speech Communication, vol. 31, no. 2, pp. 107-112,2000.

[Ill J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S.Pallett, "Darpa timit acoustic-phonetic continous speech corpus cd-rom.nist speech disc 1-1.1," NASA STI/Recon Technical Report N, vol. 93,p. 27403, 1993.

[121 S. Koolagudi and K. Rao, "Emotion recognition from speech: a review,"International Journal of Speech Technology, vol. 15, no. 2, pp. 99-117,2012.

[131 R. Banse and K. R. Scherer, "Acoustic profiles in vocal emotionexpression." J Pers Soc Psychol, vol. 70, no. 3, pp. 614-636, 1996.

[141 K. R. Scherer, "Vocal communication of emotion: A review of researchparadigms," Speech Communication, vol. 40, no. 12, pp. 227 - 256,2003.

[151 K. R. Scherer, R. Banse, H. Wallbott, and T. Goldbeck, "Vocal cuesin emotion encoding and decoding," Motivation and Emotion, vol. 15,no. 2, pp. 123-148, 1991.

[161 M. Igras and B. Zi6lko, "Baza danych nagran mowy emocjonalnej (Eng.Database of emotional speech recordings)," vol. 34, no. 2B, 2013.

[171 F. Burkhardt, A Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, "Adatabase of german emotional speech," in in Proceedings of Interspeech,Lissabon, 2005, pp. 1517-1520.

[181 M. Igras and W. Wszolek, ''Pomiary parametr6w akustycznych mowyemocjonalnej - krok ku modelowaniu wokalnej ekspresji emocji (Eng.Measurements of emotional speech acoustic parameters - a step towardsvocal emotion expression modelling)," vol. 54, no. 4, 2012.

[191 M. Igras, M. Zi6lko, and J. Galka, "Wavelet evaluation of speakeremotion," in in Proceedings of the eighteenth national conference onApplications of mathematics in biology and medicine, Krynica Morska,2012, pp. 54-59.

[201 M. Zi61ko, P. lad6w, and M. Igras, "Combination of fourier and wavelettransformations for detection of speech emotions," in 7th InternationalConference on Human System Interaction, accepted, 2014.

[211 M. M. Rahman and M. A-A. Bhuiyan, "On segmentation and extrac­tion of features from continuous bangla speech including windowing,"International Journal of Applied Research on Information Technologyand Computing, vol. 2, no. 2, pp. 31-40, 2011.

[221 T. Giannakopoulos, "A method for silence removal and segmentation ofspeech signals, implemented in matlab," University of Athens, Athens,2009.