AUTOMATIC IDENTIFICATION FOR SINGING STYLE BASED ON … · AUTOMATIC IDENTIFICATION FOR SINGING STYLE BASED ON SUNG MELODIC CONTOUR CHARACTERIZED IN PHASE PLANE Tatsuya Kako y, Yasunori

10th International Society for Music Information Retrieval Conference (ISMIR 2009)

AUTOMATIC IDENTIFICATION FOR SINGING STYLE BASED ON SUNGMELODIC CONTOUR CHARACTERIZED IN PHASE PLANE

Tatsuya Kako†, Yasunori Ohishi‡, Hirokazu Kameoka‡, Kunio Kashino‡, Kazuya Takeda†

†Graduate School of Information Science, Nagoya University‡NTT Communication Science Laboratories, NTT Corporation

[email protected], [email protected], [email protected]@eye.brl.ntt.co.jp, [email protected]

ABSTRACT

A stochastic representation of singing styles is pro-posed. The dynamic property of melodic contour, i.e., fun-damental frequency (F0) sequence, is assumed to be themain cue for singing styles because it can characterize suchtypical ornamentations asvibrato . F0 signal trajectoriesin the phase plane are used as the basic representation. Byfitting Gaussian mixture models to the observedF0 trajec-tories in the phase plane, a parametric representation is ob-tained by a set of GMM parameters. The effectiveness ofour proposed method is confirmed through experimentalevaluation where 94.1% accuracy for singer-class discrim-ination was obtained.

1. INTRODUCTION

Although no firm definition has yet been established for“singing style” in musical information processing research,several studies have reported the relationship betweensinging styles and such signal features as singing formant[1, 2] and singing ornamentations. Various research ef-forts have been made to characterize ornamentations by theacoustical property of the sung melody, i.e.,vibrato[3–11],overshoot [12], and fine fluctuation [13]. The importanceof such melodic features for perceiving singer individualitywas also reported in [14] based on psycho-acoustic exper-iments. They concluded that the average spectrum and thedynamical property of theF0 sequence affect the percep-tion of the individuality. Those studies suggest that singingstyle is related to the local dynamics of a sung melody thatdoes not contain any musical information. Therefore, inthis study, we focus on the local dynamics of theF0 se-quence, i.e., the melodic contour, as a cue of singing styleand propose a parametric representation as a model forsinging styles.

On the other hand, very few application systems havebeen reported that use the local dynamics of a sung melody.[15] reported a singer recognition experiment usingvi-brato . [16] reported a method for evaluating singing skillthrough the spectrum analysis of theF0 contour. Although

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page.

c© 2009 International Society for Music Information Retrieval.

these studies try to use the local dynamics of melodic con-tour as a cue for ornamentation, no systematic methodhas been proposed for characterizing singing styles. Alag system model for typical ornamentations was reportedin [14,17–19]; however, variation of singing styles was notdiscussed.

In this paper, we propose a stochastic phase plane asa graphical representation of singing styles and show itseffectiveness for singing style discrimination. One meritof this representation to characterize singing style is thatsince neither an explicit detection function for ornamen-tation like vibrato nor estimation of the target note is re-quired, it is robust to sung melodies.

In a previous paper [20], we applied this graphical rep-resentation of theF0 contour in the phase plane to a query-by-hamming system and neutralized the local dynamics oftheF0 sequence so that only musical information was uti-lized for the query. In contrast, in this study, we use thelocal dynamics of theF0 sequence for modeling singingstyles and disregard the musical information because mu-sical information and singing style are in a dual relation.

In this paper, we also evaluate the proposed represen-tation through a singer-class discrimination experiment inwhich we show that our proposed model can extract thedynamic properties of sung melodies shared by a group ofsingers.

In the next section, we propose stochastic phase plane(SPP) as a stochastic representation of the melodic contourand show how singing ornamentations are modeled by theproposed SPP. In Section 3, we experimentally show theeffectiveness of our proposed method through singer classdiscrimination experiments. Section 4 discusses the ob-tained results and concludes this paper.

2. STOCHASTIC REPRESENTATION OF THEDYNAMICAL PROPERTY OF MELODIC

CONTOUR

2.1 F0 signal in the Phase Plane

Such ornamental expressions in singing asvibrato arecharacterized by the dynamical property of theirF0 sig-nal. Since theF0 signal is a controlled output of the humanspeech production system, its basic dynamical characteris-tics can be related to a differential equation. Therefore, wecan use the phase plane, which is the joint plot of a variableand its time derivative, i.e.,(x, x), to depict its dynamicalproperty.

393

Poster Session 3∆

F0

Classical (female) F0 trajectory

0 10 20 304500

5000

5500

F0 [ce

nt]

Time [sec]

Classical (female) F0 - ∆F0 phase plane

50

0

-50

4500 5000 5500F0 [cent]

Classical (female) F0 - ∆∆F0 phase plane

∆∆

F0

4500 5000 5500F0 [cent]

40

20

0

-20

-40

Figure 1. Melodic contour (top) and corresponding phaseplanes forF0-∆F0 (middle) andF0-∆∆F0 (bottom)

Although the signal sequence is not given as an explicitfunction of time,F0(t), but as a sequence of numbers,{F0(n)}n=1,··· ,N , we can estimate the time derivative us-ing thedelta-coefficient given by

∆F0(n) =

K∑

k=−Kk · F0(n+ k)

K∑

k=−Kk2

, (1)

where2K is the window length for calculating the dynam-ics. Changing the window length extracts different aspectsof the signal property.

An example of such a plot for a given melodic contouris shown in Fig. 1. Here, theF0 signal (top), the phaseplane (middle), and the second order phase plane, which isgiven by the joint plot ofF0 and∆∆F0 (bottom), are plot-ted. The singing ornamentations are depicted as the localbehavior of the trajectory around the centroids that com-monly represent target musical notes.Vibrato in singing,for example, is shown as circular trajectories centered attarget notes. In the second order plane, the trajectories ap-pear as lines with a slope of -45 degrees. This shows thatthe relationship betweenF0 and∆∆F0 is given as

∆∆F0 = −F0. (2)

Hence, the sinusoidal component is imposed in the givensignal. Over/under-shoots to the target note are representedas spiral patterns around the note.

2.2 Stochastic representation of Phase Plane

Once a singing style is represented as a phase plane trajec-tory, parameterizing the representation becomes an issuefor further engineering applications. Since theF0 signalis not deterministic, i.e., it varies across singing behaviors,a stochastic model must be defined for the parameteriza-tion. By fitting a parametric probability density function tothe trajectories in the phase plane, we can build a stochastic

Pop (female) F0 - ∆F0 phase plane

4500 5000 5500F0 [cent]

∆F

0

40

20

0

-20

-40

5000 5500F0 [cent]

∆F

0

40

20

-20

-40

0

4500

Figure 2. Gaussian mixture model fitted toF0 contour inphase plane

phase plane (SPP) and use it for characterizing the melodiccontour. A common feature of the trajectories in the phaseplane is that most of their segments are distributed aroundthe target note, and therefore the distribution’s histogramis multimodal, but each mode can be represented by a sim-ple symmetric 2d or 3d-pdf. Therefore, Gaussian mixturemodel (GMM),

M∑m=1

λmN (f0(n);µm,Σm), (3)

where

f0(n) = [F0(n),∆F0(n),∆∆F0(n)]T , (4)

is adopted for the modeling.N (·) is a Gaussian distribu-tion, and

Θ = {λm,µm,Σm}m=1,··· ,M , (5)

are parameters of the model, each of which represents therelative frequency, the mean vector, and the covariance ma-trix of each Gaussian.

A GMM trained forF0 contours in the phase plane isdepicted in Fig. 2. A smooth surface is trained throughmodel fitting. The horizontal deviations of each Gaussianrepresent the stability of the melodic contour around thetarget note, but the vertical deviations represent thevibratodepth. In this manner, singing styles can be modeled by setof parametersΘ of the stochastic phase plane.

2.3 Examples of Stochastic Phase Plane

In Fig. 3, theF0 signals of three female singers are plot-ted: professional classical, professional pop, and an ama-teur. A deepvibrato is observed as a large vertical devia-tion in the Gaussians in the professional classical singer’splot. On the other hand, the amateur’s plot is character-ized by large horizontal deviations. Although deepvibratois not observed in the plot for the professional pop singer,its smaller horizontal deviation shows that she accuratelysang the melody.

394

10th International Society for Music Information Retrieval Conference (ISMIR 2009)∆F0

Classical (female)

F0 [cent]

40

20

0

-20

-400 200 400 600 800 1000 1200

Pop (female)

∆F0

F0 [cent]

20

0

-20

-400 200 400 600 800 1000 1200

40

Amateur (female)

∆F0

F0 [cent]

20

0

-20

-400 200 400 600 800 1000 1200

40

Figure 3. Stochastic phase plane models for professionalclassical (top), professional pop (middle), and amateur(bottom)

Table 1. Signal analysis conditions forF0 estimation. Har-monical PSD pattern matching [21] is used with these pa-rameters.

Signal sampling freq. 16 kHzF0 estimation window length 64 msWindow function Hanning windowWindow shift 10 msF0 contour smoothing 50 ms MA filter∆ coefficient calculation K = 2

The stochastic representations of the second order phaseplane are also shown in Fig. 4. Strong negative correla-tions betweenF0 and∆∆F0 can be found only in the plotfor the professional classical singer that also indicates deepvibrato in the singing style.

3. EXPERIMENTAL EVALUATION

The effectiveness of using SPP to discriminate differentsinging styles is evaluated experimentally.

3.1 Experimental set up

The following singing signals of six singers were used:one of each gender in the categories of professional clas-sical, professional pop, and amateur. With/without musi-cal accompaniment, each subject sang songs with Japaneselyrics and hummed. The songs were “Twinkle, Twinkle,Little Star”, and “Ode to Joy” and five etudes. A total of102 song signals was recorded.

TheF0 contour was estimated using [21]. The signalprocessing conditions for calculatingF0, ∆F0, and the∆∆F0 contours are listed in Table 1.

Since the absolute pitch of the song signals differ acrosssingers, we normalized them so that only the singing styleof each singer is used in the experiment. Normalization

Classical (female)

F0 [cent]

Pop (female)

F0 [cent]

Amateur (female)

F0 [cent]

0 200 800 1000 1200

0 800 1000 1200

200 400 600 800 1000 1200

600400

200 600400

0

∆∆F0

∆∆F0

∆∆F0

10

-10

0

10

-10

0

10

-10

0

Figure 4. 2nd order stochastic phase plane models forprofessional classical (top), professional pop (middle), andamateur (bottom)

was done in the procedure below. First, theF0 frequencyin [Hz] is converted to[cent] by

1200× log2

F0

440× 23/12−5[cent] . (6)

Then the local deviations from the tempered clavier are cal-culated by the residue operation mod(·):

mod (F0 + 50, 100). (7)

Obviously, after this conversion, theF0 value is limited to(0, 100) in [cent] .

3.2 Discrimination Experiment

The discrimination of three singer classes, i.e., profes-sional classical, professional pop, and amateur, was per-formed based on the maximuma posteriori probability(MAP) decision:

s = arg maxs

[p(s|{F0,∆F0,∆∆F0})]

= arg maxs

[1N

N∑n=1

log p(f0(n)|Θs) + log p(s)

](8)

wheres is the singer-class id andΘs is the model param-eters of thesth singer-class. We used “Twinkle-Twinkle,Little Star” and five etudes sung by singers from eachsinger class for training and “Ode to Joy” sung by the samesingers for testing. Therefore the results are independentfrom sung melodies but closed in singers.N is the lengthof the signal in the samples. Since we assumed an equala priori probability for singer-class distributionp(s), theabove MAP decision is equivalent to the Maximum Like-lihood decision.

3.3 Results

Fig. 5 shows the accuracy of the singer-class discrimi-nation. The best is attained for a 13-second input sig-

395

Poster Session 3

Signal length [sec]

Dis

crim

ina

tio

n a

ccu

racy [%

]95

90

85

805.0 7.0 9.0 11.0 13.0 15.0

M = 8

M = 16

M = 32

Figure 5. Accuracy in discriminating three singer classes

F0 , ∆F0 , ∆∆F0F0 , ∆F0 F0

Dis

crim

ina

tio

n a

ccu

racy [%

]

100

80

60

40

20

0

Figure 6. Comparing accuracy in discriminating singerclasses

nal. The accuracy increases with the length of the testsignal and 94.1% is attained with an 8-mixture GMM forsinger-class models, when a 13-second signal is availablefor the test input. No significant improvement in accuracywas found for the longer test input because more song-dependent information contaminated the test signal. Fig. 6compares the accuracy of singer-class discriminations us-ing the three sets of features:F0 only, (F0, ∆F0), and (F0,∆F0, ∆∆F0). As shown in the figure, by combiningF0

and∆F0, the discrimination error rate becomes half of theerror when only usingF0. Combining second order deriva-tive ∆∆F0 further reduces the error but not as much as thecase of∆F0. These results show that the proposed stochas-tic representation of the phase plane effectively character-izes the singing styles of the three singer classes.

4. DISCUSSION

Our proposed method for representing and parameterizingtheF0 contour effectively discriminates the three typicalsinger classes, i.e., professional classical and pop, and am-ateurs. To confirm that the method models the singingstyles (and not singer individuality), we compared our pro-posed representation with MFCC under two conditions.As a closed condition, we trained three MFCC-GMMsusing “Twinkle-Twinkle, Little Star” and five etudes sungby six (male and female professional classic, professionalpop, and amateur) singers and used “Ode to Joy” sung by

MFCCF0 , ∆F0 , ∆∆F0

Dis

crim

ina

tio

n a

ccu

racy [%

]

100

80

60

40

20MFCC, ∆MFCC

Closed condition Open condition

Figure 7. Comparing proposed representation with MFCCunder two conditions

the same singers for testing. On the other hand, as anopen condition, we evaluated the MFCC-GMMs througha singer independent manner where singer-class models(GMMs) were trained by female singer data and tested bymale singer data. As shown in Fig. 7, the performancesof the MFCC-GMM and the proposed method are almostidentical (95.0%) in the closed condition. However, in thenew (unseen) singer experiment, the result of the MFCC-GMM system significantly degraded to 33.3%, but theproposed method attained 87.9% accuracy. These resultssuggest that the MFCC-GMM system does not model thesinging style but discriminates singer individuality. How-ever, since SPP-GMM can correctly classify even an un-seen singer’s data, our proposed representation models theF0 dynamic characteristics common within a singer classbetter than singer individuality.

5. SUMMARY

In this paper, we proposed a model for singing styles basedon the stochastic graphical representation of the local dy-namical property of theF0 sequence. Since various singingornamentations are related to signal production systemsdescribed by differential equations, phase plane is a rea-sonable space for depicting singing styles. Furthermore,the Gaussian mixture model effectively parameterizes thegraphical representation; therefore, more than 90% accu-racy can be achieved in discriminating the three classes ofsingers.

Since the scale of the experiments was small, increasingthe number of singers and singer classes is critical futurework. Evaluating the robustness of the proposed method tonoisyF0 sequences estimated under such realistic singingconditions as “karaoke” is also an inevitable step for build-ing real-world application systems.

6. REFERENCES

[1] J. Sundberg,The Science of the Singing. NorthernIllinois University Press, 1987.

[2] J. Sundberg, “Singing and timbre,”Music room acous-tics, vol. 17, pp. 57–81, 1977.

396

10th International Society for Music Information Retrieval Conference (ISMIR 2009)

[3] C. E. Seashore, “A musical ornament, the vibrato,” inProc. Psychology of Music. McGraw-Hill Book Com-pany, 1938, pp. 33–52.

[4] J. Large and S. Iwata, “Aerodynamic study of vi-brato and voluntary ”straight tone” pairs in singing,”J. Acoust. Soc. Am., vol. 49, no. 1A, p. 137, 1971.

[5] H. B. Rothman and A. A. Arroyo, “Acoustic variabil-ity in vibrato and its perceptual significance,”J. Voice,vol. 1, no. 2, pp. 123–141, 1987.

[6] D. Myers and J. Michel, “Vibrato and pitch transi-tions,” J. Voice, vol. 1, no. 2, pp. 157–161, 1987.

[7] J. Hakes, T. Shipp, and E. T. Doherty, “Acoustic char-acteristics of vocal oscillations: Vibrato, exaggeratedvibrato, trill, and trillo,”J. Voice, vol. 1, no. 4, pp. 326–331, 1988.

[8] C. D’Alessandro and M. Castellengo, “The pitch ofshort-duration vibrato tones,”J. Acoust. Soc. Am.,vol. 95, no. 3, pp. 1617–1630, 1994.

[9] D. Gerhard, “Pitch track target deviation in naturalsinging,” inProc. ISMIR, 2005, pp. 514–519.

[10] K. Kojima, M. Yanagida, and I. Nakayama, “Variabil-ity of vibrato -a comparative study between japanesetraditional singing and bel canto-,” inProc. SpeechProsody, 2004, pp. 151–154.

[11] I. Nakayama, “Comparative studies on vocal expres-sions in japanese traditional and western classical-style singing, using a common verse,” inProc. ICA,2004, pp. 1295–1296.

[12] G. de Krom and G. Bloothooft, “Timing and accuracyof fundamental frequency changes in singing,” inProc.ICPhS, 1995, pp. 206–209.

[13] M. Akagi and H. Kitakaze, “Perception of synthesizedsinging voices with fine fluctuations in their funda-mental frequency contours,” inProc. ICSLP, 2000, pp.458–461.

[14] T. Saitou, M. Goto, M. Unoki, and M. Akagi, “Speech-To-Singing synthesis: Converting speaking voices tosinging voices by controlling acoustic features uniqueto singing voices,” inProc. WASPAA, 2007, pp. 215–218.

[15] T. L. Nwe and H. Li, “Exploring vibrato-motivatedacoustic features for singer identification,”IEEETransactions on Audio, Speech, and Language process-ing, pp. 519–530, 2007.

[16] T. Nakano, M. Goto, and Y. Hiraga, “An automaticsinging skill evaluation method for unknown melodiesusing pitch interval accuracy and vibrato features,” inProc. Interspeech, 2006, pp. 1706–1709.

[17] H. Mori, W. Odagiri, and H. Kasuya, “F0 dynamics insinging: Evidence from the data of a baritone singer,”IEICE Trans. Inf. and Syst., vol. E87-D, no. 5, pp.1086–1092, 2004.

[18] N. Minematsu, B. Matsuoka, and K. Hirose, “Prosodicmodeling of nagauta singing and its evaluation,” inProc. SpeechProsody, 2004, pp. 487–490.

[19] L. Reqnier and G. Peeters, “Singing voice detection inmusic tracks using direct voice vibrato,” inProc. IC-CASP, 2009, pp. 1658–1688.

[20] Y. Ohishi, M. Goto, K. Itou, and K. Takeda., “Astochastic representation of the dynamics of sungmelody,” inProc. ISMIR, 2007, pp. 371–372.

[21] M. Goto, K. Itou, and S. Hayamizu, “A real-time filledpause detection system for spontaneous speech recog-nition,” in Proc. Eurospeech, 1999, pp. 227–230.

397

Documents

AUTOMATIC IDENTIFICATION FOR SINGING STYLE BASED ON … · AUTOMATIC IDENTIFICATION FOR SINGING STYLE BASED ON SUNG MELODIC CONTOUR CHARACTERIZED IN PHASE PLANE Tatsuya Kako y, Yasunori