6
Laminae: A stochastic modeling-based autonomous performance rendering system that elucidates performer characteristics Kenta Okumura Nagoya Institute of Technology [email protected] Shinji Sako Nagoya Institute of Technology [email protected] Tadashi Kitamura Nagoya Institute of Technology [email protected] ABSTRACT This paper proposes a system for performance rendering of keyboard instruments. The goal is fully autonomous ren- dition of a performance with musical smoothness without losing any of the characteristics of the actual performer. The system is based on a method that systematizes combi- nations of constraints and thereby elucidates the render- ing process of the performer’s performance by defining stochastic models that associate artistic deviations observed in a performance with the contextual information notated in its musical score. The proposed system can be used to search for a sequence of optimum cases from the combi- nation of all existing cases of the existing performance ob- served to render an unseen performance efficiently. Evalu- ations conducted indicate that musical features expected in existing performances are transcribed appropriately in the performances rendered by the system. The evaluations also demonstrate that the system is able to render performances with natural expressions stably, even for compositions with unconventional styles. Consequently, performances ren- dered via the proposed system have won first prize in the autonomous section of a performance rendering contest for computer systems. 1. INTRODUCTION In recent times, several autonomous systems for automatic performance rendering have been proposed [1, 2]. Their main motivation is elucidation of the existing performance and the realization of a virtual performer [3, 4]. Such sys- tems generally control the rules that determine performance expression without asking for interaction with the user in the rendering process of the performance. Our focus is on the ability to render performances without losing any of the characteristics of the human performer, and to repli- cate such characteristics. One of the most rational ideas for achieving this is to relate the expression included in segmented cases of the performance of human virtuosi and the information that describes the conditions in which they were performed. The typical method used to handle expressions included in each case is to transcribe the statistical trend in sections of accumulated cases [5–7]. The advantage of that method is that unnatural expressions are less likely to occur in the rendered performance. However, that method is not nec- essarily advantageous as it may not faithfully reproduce the performer’s characteristics, since the features of the performer that were originally provided in the cases are smoothed by the statistics. Conversely, there is a method that directly transcribes the expression of the particular Copyright: c 2014 Kenta Okumura et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 Unported License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. case among the cases that have been accumulated [8–10]. This is a more suitable method for faithful reproduction of the performer’s characteristics because of its certain re- tention of the feature of the cases. However, the problem with this method is that the performances may lose natu- ralness since they are rendered by connecting cases that are not continuous in the original performance. In the existing methods, the rules used to select the case are not optimized for the composition to render a performance by the system because they are constructed based on the compositions originally performed by the performer. To solve this prob- lem, we propose a method that searches for the optimum case to transcribe the expression from the alternatives, aug- mented by the moderation of a strict rule. This is done with the assumption that the possibility exists a case with an ex- pression that can render a more natural performance exists in those cases that are never selected because they are not strictly in accordance with the selection rule. The information that describes the conditions of the case that was originally performed must be elucidated with gen- erality to select the optimum case for every direction upon rendering of the performance. Most existing autonomous systems require the information related to the interpreta- tion of the composition by the performer as input. How- ever, it is difficult to acquire rules that can accurately de- scribe the relationship, even when it is analyzed by experts. In addition, to explain the relationship with generality is also difficult for the performer because of fluctuations in the interpretation itself [11]. We consider an approximate description of the relationship using the combination of simple information obtained uniquely from the score rather than a higher-order interpretation of the performer. We previously proposed a method that enables systematic as- sociation of the relationships without using such unstable information, under the assumption that there is a tendency in the behavior of the performer that depends on the con- text of the performance directions locally derivable from the score [12]. That method is able to eliminate the de- pendency on any information other than the performance itself, because it uses no such information containing the fluctuations mentioned above. The essence of the problem that the method resolves, in terms of classifying the cases of existing performances based on the information from the score, is congruent with our proposal. 2. METHODS CONSTITUTING THE SYSTEM Performers interpret the directions S =(s 1 ,...,s M ) that are notated in the given score, and renders the performance sequence ˆ R = (ˆ r 1 ,..., ˆ r N ) by applying their intended ex- pression. On the assumption of the sequence of strict di- rection ˆ S = (ˆ s 1 ,..., ˆ s N ) that represents the contents of the performance, the applied expressions are observed as sequences D =(d 1 ,...,d N ) for factors F = (AT, GR, DR, BR) between ˆ S and ˆ R as follows: A. Georgaki and G. Kouroupetroglou (Eds.), Proceedings ICMC|SMC|2014, 14-20 September 2014, Athens, Greece - 1271 -

Laminae: A stochastic modeling-based autonomous ...speech.di.uoa.gr/ICMC-SMC-2014/images/VOL_2/1271.pdfstochastic models that associate artistic deviations observed in a performance

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Laminae: A stochastic modeling-based autonomous ...speech.di.uoa.gr/ICMC-SMC-2014/images/VOL_2/1271.pdfstochastic models that associate artistic deviations observed in a performance

Laminae: A stochastic modeling-based autonomous performance renderingsystem that elucidates performer characteristics

Kenta OkumuraNagoya Institute of [email protected]

Shinji SakoNagoya Institute of [email protected]

Tadashi KitamuraNagoya Institute of [email protected]

ABSTRACT

This paper proposes a system for performance rendering ofkeyboard instruments. The goal is fully autonomous ren-dition of a performance with musical smoothness withoutlosing any of the characteristics of the actual performer.The system is based on a method that systematizes combi-nations of constraints and thereby elucidates the render-ing process of the performer’s performance by definingstochastic models that associate artistic deviations observedin a performance with the contextual information notatedin its musical score. The proposed system can be used tosearch for a sequence of optimum cases from the combi-nation of all existing cases of the existing performance ob-served to render an unseen performance efficiently. Evalu-ations conducted indicate that musical features expected inexisting performances are transcribed appropriately in theperformances rendered by the system. The evaluations alsodemonstrate that the system is able to render performanceswith natural expressions stably, even for compositions withunconventional styles. Consequently, performances ren-dered via the proposed system have won first prize in theautonomous section of a performance rendering contest forcomputer systems.

1. INTRODUCTION

In recent times, several autonomous systems for automaticperformance rendering have been proposed [1, 2]. Theirmain motivation is elucidation of the existing performanceand the realization of a virtual performer [3, 4]. Such sys-tems generally control the rules that determine performanceexpression without asking for interaction with the user inthe rendering process of the performance. Our focus is onthe ability to render performances without losing any ofthe characteristics of the human performer, and to repli-cate such characteristics. One of the most rational ideasfor achieving this is to relate the expression included insegmented cases of the performance of human virtuosi andthe information that describes the conditions in which theywere performed.

The typical method used to handle expressions includedin each case is to transcribe the statistical trend in sectionsof accumulated cases [5–7]. The advantage of that methodis that unnatural expressions are less likely to occur in therendered performance. However, that method is not nec-essarily advantageous as it may not faithfully reproducethe performer’s characteristics, since the features of theperformer that were originally provided in the cases aresmoothed by the statistics. Conversely, there is a methodthat directly transcribes the expression of the particular

Copyright: c©2014 Kenta Okumura et al. This is an open-access article distributed

under the terms of the Creative Commons Attribution 3.0 Unported License, which

permits unrestricted use, distribution, and reproduction in any medium, provided

the original author and source are credited.

case among the cases that have been accumulated [8–10].This is a more suitable method for faithful reproductionof the performer’s characteristics because of its certain re-tention of the feature of the cases. However, the problemwith this method is that the performances may lose natu-ralness since they are rendered by connecting cases that arenot continuous in the original performance. In the existingmethods, the rules used to select the case are not optimizedfor the composition to render a performance by the systembecause they are constructed based on the compositionsoriginally performed by the performer. To solve this prob-lem, we propose a method that searches for the optimumcase to transcribe the expression from the alternatives, aug-mented by the moderation of a strict rule. This is done withthe assumption that the possibility exists a case with an ex-pression that can render a more natural performance existsin those cases that are never selected because they are notstrictly in accordance with the selection rule.

The information that describes the conditions of the casethat was originally performed must be elucidated with gen-erality to select the optimum case for every direction uponrendering of the performance. Most existing autonomoussystems require the information related to the interpreta-tion of the composition by the performer as input. How-ever, it is difficult to acquire rules that can accurately de-scribe the relationship, even when it is analyzed by experts.In addition, to explain the relationship with generality isalso difficult for the performer because of fluctuations inthe interpretation itself [11]. We consider an approximatedescription of the relationship using the combination ofsimple information obtained uniquely from the score ratherthan a higher-order interpretation of the performer. Wepreviously proposed a method that enables systematic as-sociation of the relationships without using such unstableinformation, under the assumption that there is a tendencyin the behavior of the performer that depends on the con-text of the performance directions locally derivable fromthe score [12]. That method is able to eliminate the de-pendency on any information other than the performanceitself, because it uses no such information containing thefluctuations mentioned above. The essence of the problemthat the method resolves, in terms of classifying the casesof existing performances based on the information fromthe score, is congruent with our proposal.

2. METHODS CONSTITUTING THE SYSTEM

Performers interpret the directions S = (s1, . . . , sM ) thatare notated in the given score, and renders the performancesequence R = (r1, . . . , rN ) by applying their intended ex-pression. On the assumption of the sequence of strict di-rection S = (s1, . . . , sN ) that represents the contents ofthe performance, the applied expressions are observed assequences D = (d1, . . . , dN ) for factors F = (AT, GR,DR, BR) between S and R as follows:

A. Georgaki and G. Kouroupetroglou (Eds.), Proceedings ICMC|SMC|2014, 14-20 September 2014, Athens, Greece

- 1271 -

Page 2: Laminae: A stochastic modeling-based autonomous ...speech.di.uoa.gr/ICMC-SMC-2014/images/VOL_2/1271.pdfstochastic models that associate artistic deviations observed in a performance

c5strict direction

c5

performance rn-1 rnrn+1

sn-1^

sn^

sn+1^

n-1: treble_Part, c#5, 4th_Note, piano, 4/18_bars, 1/6_beat, etc... n: treble_Part, e5, 16th_Note, forte, 4/18_bars, 3/6_beat, etc... n+1: treble_Part, d5, 16th_Note, forte, 4/18_bars, 3.5/6_beat, etc...

local contextual information for note

context-dependent models for note

feature quantity of Attack Timing (AT)

dnAT

context-dependent models for beat

dnBR

rn-1rn rn+1

sn-1^ sn

^ sn+1^feature quantity of

Local BPM Ratio (BR)

local contextualinformation for beat

n-1: onset_Pattern=1000/1000, 3_notes, 4/18_bars, 1/6_beat, etc... n: onset_Pattern=1110/1010, 7_notes, 4/18_bars, 3/6_beat, etc... n+1: onset_Pattern=0010/0000, 2_notes, 4/18_bars, 5/6_beat, etc...

bass

mid-bass

mid-treble

treble

given score

Figure 1: Formation of context-dependent models.

DAT (Attack Timing): Timing of striking of the key in beatsper quarter note.

DGR (Gatetime Ratio): The ratio of the time taken to de-press a key in the performance to that note’s length onthe score. If the length of the performance is shorterthan the score’s instruction, the value is less than one.

DDR (Dynamics Ratio): Dynamics of keying in the ratioof the notated dynamics. The value is acquired in thesame manner as DGR.

DBR (Local BPM Ratio): Ratio of the beat’s BPM to theaverage BPM of the performance.

These are the main ingredients of the performance expres-sion that are utilized in the operation of the instrument un-der the artistic intention and physical constraints of the per-former [13]. We also observe the difference in their quan-tities between the preceding feature quantities, since it isbelieved that the rendering of various quantities depend onthe tendency of their preceding direction. In the case ofperformance rn and its direction sn, the feature quantitiesand such differences for the factors F are extracted by thefollowing equations:

dFn =

{rFn − sFn , F = ATrFn /s

Fn , F = (GR,DR,BR)

, (1)

d∆Fn = dFn − dFn−1, F = (AT,GR,DR,BR) . (2)

Even in the performance based on the score, another se-ries of cases is excited if a trigger note that has the direc-tion of insertion of notes, such as trill, for example, existsin the vicinity. The following sequences of informationX = (x1, . . . , xN ) are described to consider the generalpossibility that the number of cases for the note isM ' N :

XPS (Pitch Shift): Integer value of the distance from thepitch directed by the score. The value is usually zero.

XKS (Key Strokes): Number of notes performed for thecorresponding note in the score. The value is usuallyone.

This information makes possible to associate plural casesfor performance direction sm. The system can render the

application of questions about context to all context-dependent models

leaf c2leaf c1 leaf c4leaf c3 leaf cmleaf cl leaf cn leaf cL

yes no

Is part ofcurrent note

melody?

Is notation ofcurrent note

slur?

Is octave ofcurrent note

lower than 4th?

Is syllable ofcurrent note

higher than V?

Is local pos. ofpreceding notelater than 50%?

Is type ofsucceeding notedotted quarter?

Is beam ofsucceeding note

continue?

parent(root) node

child(middle) nodes

Figure 2: Systematization of context-dependent models.

performance sequence V = (v1, . . . , vN ) that accommo-dates the possibility of such a mismatch by referring to in-formation in xm corresponding to vm, if the optimum se-ries of cases V = (v1, . . . , vM ) to perform the sequence ofscore S is determined by searching for cases that qualifyas candidates using the method discussed later.

2.1 Modeling and systematization of the cases

In this proposal, cases from existing performances are madeselectable by using only the performance direction infor-mation available from the score. Context-dependent mod-els for each case are defined to describe the relationshipof feature quantities and strict direction (Figure 1). Thetendency of G factors of feature quantities and differencein the case of rn based on sn are regarded in this modelas the multivariate normal distribution with the probabilitydensity function shown in the following equation:

P(dn|µn,σn

)=∏f∈F

P(dfn|µ

fn, σ

fn

)=

exp

−∑f∈F

(dfn−µ

fn

)22σfn

√(2π)G

∏f∈F σ

fn

(3)

{F = (AT,GR,DR,∆AT,∆GR,∆DR) , G = 6, for noteF = (BR,∆BR) , G = 2, for beat .

Free parameters for each variable of the feature factor arereduced by regarding them as independent. It is consideredthat they are interdependent in the performer’s individual-ity; however, determining the shape they take is difficult,and interpretation problems also exist.

The combination of the contextual information derivedfrom sn−1, sn, sn+1 is associated with the model, basedon the assumption that the local context around the di-rection contributes to the rendering of feature quantities.For the direction about note, various types of informationderivable from the score are already under validation ascontextual factors [12]. They are primarily in respect ofthe harmony, which can be regarded as a series consistingof multiple voices and accompaniment, and the main andsub-melodies. According to the orientation of stems of thenotes and positional relationship of the chord, each voicepart and can be determined automatically and uniquely.Therefore, dn−1 and dn+1 for dn are defined with con-sideration of the structure of the voices and the chords. Inthe case of the beat, on the other hand, the quantity of in-formation written in a range of one beat to become the ob-servation unit of dn constantly changes in the score. Formodels of each beat, directions about rhythm are associ-ated as quantized patterns of keying for each voice andtheir density, in addition to the global information aboutthe composition.

Refinement of the model with a variety of contextual in-formation is required in order to obtain a context-dependentmodel that can uniquely describe the rendering process ofany case. However, existing performances and the cases

A. Georgaki and G. Kouroupetroglou (Eds.), Proceedings ICMC|SMC|2014, 14-20 September 2014, Athens, Greece

- 1272 -

Page 3: Laminae: A stochastic modeling-based autonomous ...speech.di.uoa.gr/ICMC-SMC-2014/images/VOL_2/1271.pdfstochastic models that associate artistic deviations observed in a performance

leaf corresponding to the contextual label of sm

middle node: - contains statistics of training data of following branches - is the confluence of backing-off pathes for leaf cm,1

and leaf cm,l

backing-off path from cm,1

decision tree: systematized

context-dependent models

root node: contains statistics of all training data

S : given score to render a performance

another leaf belonging to middle node bm,l

that contains another style of expression

bK-1bm,root

bm,l

cm,1 cm,l

bm,1

sm

Figure 3: Decision tree backing-off concept.

derivable from them are limited. This means that acquir-ing models that are able to correlate all the contextual in-formation is effectively impossible. A solution that sys-tematizes the sharing rules is desirable to use as an alterna-tive to any of the finite context-dependent models even forunseen contextual information.

By classifying all context-dependent models using tree-based clustering [14], a decision-tree can be constructed(Figure 2). The structure of the tree elucidates the methodby which the case can be rendered with some kind of trendin the performance by the combination of contextual in-formation. Classified context-dependent models for eachcase are individually arranged at the leaf node of the ter-minal, and questions about the contextual information be-come classification criteria and are stored in each interme-diate node. It is possible to reach the leaf node of the casewith the most similar feature quantities by tracing the in-termediate nodes of the tree structure according to eachquestion from the root node. We believe this method effec-tively identifies known cases with appropriate expressionfor contextual information of the unknown composition.

2.2 Selecting cases for rendering performance

Owing to the dependence of the optimization criteria ofthe tree structure on the tendencies of feature quantitiesand the definition of contextual information, extreme dif-ficulty involved in acquiring the optimum tree structure torender the performance of unseen composition is an issuein the proposal. This means that the corresponding leafnode that is identified by descending the structure withreference to the contextual information is not necessarilythe optimum for the performance to render. From exam-ples of analyses obtained in our prior study [12], there is arelatively high versatility that can be commonly explainedin the tendency of nodes located near the root of the treestructure. On the other hand, it can be said that nodes neareach leaf are specialized in their particular trends. The tar-get of the search for an optimum case should be a subsetaround the corresponding leaf, and that subset can be aug-mented by decision-tree backing-off [15]. Candidate casesfor the search are gradually augmented from the leaf nodecm,1, which corresponds to the contextual information ofthe mth direction sm of S (Figure 3).

The sequence V is assumed as optimum to render theperformance of S . vm is selected from the candidate casesCm = (cm,1, . . . , cm,l, . . . , cm,Lm

) that are augmented bythe backing-off. If it is assumed that these selections areallowed for each of sm, dynamic programming [16] maybe applied for this search according to the principle of op-timality (Figure 4).

cont

ext-d

epen

dent

mod

els

augm

ente

d by

the

decis

ion

tree

back

ing-

off

cm-1,Lm

cm-1,m

cm-1,k

cm-1,1

cm,Lm

cm,m

cm,l

cm,1

S : given score to render a performance

V : seq

uenc

e of

opt

imum

cas

es fo

r Sse

arch

ed b

y dy

nam

ic pr

ogra

mm

ing

Cm-2 Cm-1 search candidates of dynamic programming

Cm Cm+1

sm-2 sm-1 sm sm+1

vm-3

vm-2

vm-1

vm

Figure 4: Dynamic programming to select cases that con-stitute the performance sequence V .

The likelihood based on the feature quantities dm,l thatare found in cm,l and the statistics of the middle node bm,l

are used to evaluate the suitability of selecting a case cm,l

for sm. First, selection of a case cm,l for sm is evaluated byh1(c1,l). Next, selection of a pair of cases (c1,k, c2,l) for(s1, s2) is evaluated by h2(c1,k, c2,l). This process contin-ues until final evaluation by hM−1,M . The formulas usedto obtain these evaluation values are shown below:

h1

(c1,l

)= P

(dF1,l|µ

F1,l,σ

F1,l

)=∏f∈F

P(df1,l|µf

1,l, σf1,l

), (4)

hm

(cm−1,k, cm,l

)= P

(dFm,l|µ

Fm,l,σ

Fm,l

)P(∆d

Fm,l|µ

∆Fm,l,σ

∆Fm,l

)(2 ≤ m ≤ M) . (5){

F = (AT,GR,DR) , ∆F = (∆AT,∆GR,∆DR) , for note,

F = BR ∆F = ∆BR, for beat..

∆dFm,l = dF

m,l − dFm−1,k are the differential quantities of

eachF obtained by assuming the selection of (cm−1,k, cm,l)

for (sm−1,k, sm,l). The search for optimum cases can beviewed as a problem of maximizing evaluation values foreach direction of S in the objective function described be-low:J = h1

(c1,l

)+ h2

(c1,k, c2,l

)+ . . . + hM

(cM−1,k, cM,l

)→ max . (6)

All cases included in the tree structure can be candidatesfor the search, since backing-off reaches the root node fi-nally. However, a search targeting all cases is not alwaysnecessary because the possibility that one of the cases in aposition significantly distant from the correspondent leafnode in the tree structure is selected as the optimum isunlikely. Therefore, more efficient search is also consid-ered in terms of computational cost by controlling the scaleof any augmentation of candidate cases. Index value θm,l

(shown below) is used to determine continuation or termi-nation of the backing-off, and is determined by the thresh-old defined in advance:

θm,l =(bmaxm − bmin

m

)−1 {P(dm,1|µm,l,σm,l

)− bmin

m

}(0 ≤ θm,l ≤ 1

). (7)

bmaxm and bmin

m are the maximum and minimum valuesamong the likelihoods obtained for each intermediate nodeand correspondent leaf node cm,1. Augmentation of can-didates is restricted only to cases that are very close tothe correspondent leaf node if the threshold is close toθm,l = 1.

A. Georgaki and G. Kouroupetroglou (Eds.), Proceedings ICMC|SMC|2014, 14-20 September 2014, Athens, Greece

- 1273 -

Page 4: Laminae: A stochastic modeling-based autonomous ...speech.di.uoa.gr/ICMC-SMC-2014/images/VOL_2/1271.pdfstochastic models that associate artistic deviations observed in a performance

Dataset G(1) P(1) P(2)Performer G. Gould M. J. PiresData to train context- W. A. Mozart’s Piano Sonata, the second and third movements W. A. Mozarts Piano Sonata, thedependent models of K. 279 and the first movement of K. 310. second and third movements of K. 310.Amount of data Nnote = 2305, Nbeat = 396. Nnote = 2292, Nbeat = 396. Nnote = 2475, Nbeat = 504.

Table 1: Datasets used for the training of context-dependent models.

10 20 30 40 50 60 70 80 90

100

1 10 100 1000 10000Average candidate cases of note

P(1+2)P(2)

θ > 0.8θ > 0.7

θ > 0.6

θ > 0.5

P(1)

θ > 0.9

θ > 0.8

θ > 0.7

θ > 0.6

θ > 0.3

G(1)

θ > 0.9

θ > 0.8

θ > 0.7

θ > 0.6

θ > 0.4θ > 0.3

(1, 2)

(w/o b.off)

(1+2)all cases

Conc

orda

nce

rate

of s

earc

h wi

th a

ll cas

es (%

)

Figure 5: Concordance rate of selected cases for note.

3. EVALUATION OF THE SYSTEM

We implemented a system for performance rendering basedon the proposed method and evaluated the rendered perfor-mances from plural terms. Datasets used for the training ofcontext-dependent models were obtained from a database 1

created by musical dictation of the waveforms of a num-ber of virtuosi’s piano solo performances on specific MIDIsound generators. The database contains such data relatedto note and beat converted to the format described above.Directions of the scores were converted to the data thatare associable with performance expression by using Mu-sicXML.

3.1 Objective evaluation

In order to verify the effectiveness of the decision treebacking-off method, a number of performances with casesselected by differently scaled backing-off were rendered.The scale of the cases to augment as candidates to searchwas controlled by the criteria shown in Equation (7). Thescore used to render the performance here is unified to“W. A. Mozart’s Piano Sonata, the first movement of K.279, (treble voice part),” which is unseen in all the data inthe datasets displayed in the Table 1 that are used to traincontext-dependent models.

For reference assuming a search for all cases of the train-ing data, the matching rates of the cases at the conditionsof varied search ranges were examined. The results fornote are shown in Figure 5, while those for beat are shownin Figure 6. These figures show that the results of selec-tion in any dataset were exactly matched with the resultsof “search for all cases,” even when the candidates beingsearched for were limited to only cases from 20% to 50%of all those that are close to cm,1 in the decision tree. It canbe seen that the cases that are actually effective for any di-rection are few; thus, effective selection of the case withthe optimum expression for such direction from amongthem should be regarded as important. Decision treebacking-off is a method that allows optimized search ofsuch cases by reducing the number of candidates that needto be examined to find the optimum case for rendering theperformance of the unseen direction.

1 CrestMusePEDB ver. 2, http://www.crestmuse.jp/pedb

92

94

96

98

100

1 10 100 1000

Conc

orda

nce

rate

of s

earc

h wi

th a

ll cas

es (%

)

Average candidate cases of beat

P(1+2)P(2)

θ > 0.2

θ > 0.15

P(1)

θ > 0.3

θ > 0.28

G(1)

θ > 0.2

θ > 0.17all cases

(w/o b.off)

(1+2)(2)(1)

Figure 6: Concordance rate of selected cases for beat.

Figure 7 shows the trajectories of the feature quantitiesfor each factor rendered by G(1). In general, similar trendsare obtained in terms of each search range. However, forthe w/o backing-off condition, there are many cases thatfluctuate in the direction opposite to the other conditionsand have variations that appear to ignore the trend of all theothers. This is a comparison without the correct sequence;however, in general, it is unlikely that such significant lo-cal variations without continuity engender naturalness inthe performance. The efficacy of introducing decision treebacking-off can also be confirmed from the fact that thesestrange variations are fixed even in a relatively small aug-mentation of the search range such as θnotem,l > 0.9 in note,and θbeatm,l > 0.2 in beat.

The trajectories of the feature quantities for each factorrendered by P(1), P(2), and P(1+2) are shown in Figure 8.Between these results, the dependence on the combinationof the composition and its performer, which is used as thedata for training of context-dependent models, is also clear.It can be seen that the tree structure of the models can cap-ture the characteristics of the rendering process of the per-formance in such combinations. To obtain desirable resultsfor the rendering of unknown compositions, considerationof not only the combination of compositions to train mod-els, but also the difference in characteristics depending onthe performer is required. However, clear generalizationof the combination and the appropriate amount of trainingdata is difficult to obtain solely from the combination ofcomposition and performer available here. Validation us-ing a context-dependent model separately trained by thecombination of cases obtained from a variety of perfor-mances is needed.

3.2 Subjective evaluation

In order to verify the musical aspects of the performancesrendered by the system, they should be evaluated by hu-man listeners. For this evaluation, three performers’ mod-els were trained with the data described below:

C-A: F. Chopin’s Etude Nos. 3, 4, 23; Mazurka No. 5;Nocturne Nos. 2, 10; Prelude Nos. 7, 20; and WaltzNos. 1, 3, 9, 10, performed by V. Ashkenazy. Nnote =12092, Nbeat = 2566.

A. Georgaki and G. Kouroupetroglou (Eds.), Proceedings ICMC|SMC|2014, 14-20 September 2014, Athens, Greece

- 1274 -

Page 5: Laminae: A stochastic modeling-based autonomous ...speech.di.uoa.gr/ICMC-SMC-2014/images/VOL_2/1271.pdfstochastic models that associate artistic deviations observed in a performance

-0.2

-0.1

0

0.1

0.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Feat

ure

quan

tity

of A

T (B

PQN)

Position of note on in the measure number

all casesθ > 0.6θ > 0.9

w/o b. off

(a) Attack Timing in G(1).

0

0.5

1

1.5

2

2.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Feat

ure

quan

tity

of G

R

Position of note on in the measure number

all casesθ > 0.6θ > 0.9

w/o b. off

(b) Gatetime Ratio in G(1).

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Feat

ure

quan

tity

of D

R

Position of note on in the measure number

all casesθ > 0.6θ > 0.9

w/o b. off

(c) Dynamics Ratio in G(1).

0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Feat

ure

quan

tity

of B

R

Position of beat in the measure number

all casesθ > 0.17

θ > 0.2w/o b. off

(d) Local BPM Ratio in G(1).

Figure 7: Feature quantities in performances rendered byG(1), for each search range of cases.

M-G: W. A. Mozart’s Piano Sonata, all movements of K.279 and the first movement of K. 310, performed byG. Gould. Nnote = 3112, Nbeat = 537.

M-P: W. A. Mozart’s Piano Sonata, all movements of K.279, K. 310, and K. 545 and the second and thirdmovements of K. 331, performed by M. J. Pires. Nnote

= 13703, Nbeat = 2613.

Seven compositions that were not included in the train-ing data and have irrelevant musicality were used for ren-dering. Twenty participants who were chosen without re-gard to any professional experience playing musical instru-ments, evaluated them in five phases. The results obtainedfor the entire evaluation and metrics used are shown inFigure 9(a). The results obtained by transferring only fea-ture quantity on notes or beats are also shown for reference.Figure 9(b) shows the results evaluated for each composi-tion.

In general, the results obtained are good, as evidenced bythe overall evaluation shown in Figure 9(a) having an ap-proximate value of four. These values are generally higherthan those obtained for the condition in which only featurequantities related to note are transferred, but the trend isalso seen to follow the results for the condition in whichonly the feature quantity related to beat is transferred. Inthe M-P model, there is a large bias relative to the con-tribution to the quality of the performance between eachlimited transcription condition. It is not necessary for their

-0.2

-0.1

0

0.1

0.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Feat

ure

quan

tity

of A

T (B

PQN)

Position of note on in the measure number

P(1+2)P(2)P(1)

(a) Attack Timing in P(1), P(2), and P(1+2).

0

0.5

1

1.5

2

2.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Feat

ure

quan

tity

of G

R

Position of note on in the measure number

P(1+2)P(2)P(1)

(b) Gatetime Ratio in P(1), P(2), and P(1+2).

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Fe

atur

e qu

antit

y of

DR

Position of note on in the measure number

P(1+2)P(2)P(1)

(c) Dynamics Ratio in P(1), P(2), and P(1+2).

0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Feat

ure

quan

tity

of B

R

Position of beat in the measure number

P(1+2)P(2)P(1)

(d) Local BPM Ratio in P(1), P(2), and P(1+2).

Figure 8: Feature quantities in performances rendered byP(1), P(2), and P(1+2), search with all cases.

contribution to the performance to always be equal, but thetree structure of the context-dependent model of beat maynot perform as well as that for notes as regards optimumfor unknown compositions.

In Figure 9(b), more than half of the compositions for M-G have ratings above four. Simply using a lot of casesto construct the tree structure should not be done lightlybecause extension of similar cases as candidates that onlyresult in marginal difference in the selection of a case isnot desirable for search efficiency. The absolute amount oftraining data used in M-G is less than that of M-P, but theperformances of M-G have a tendency of expression that isable to efficiently capture and transcribe their characteris-tics. Figure 9(b) also shows large differences in the ratingsdepending on the compositions in C-A. Combinations ofcontextual information suitable for the description of thecontrol of expression are different in some cases, since thetendency of expression is also different from the differencein characteristics of the composition even in the case of thesame performer. Constructing the tree structure by mixinga large number of such cases is unlikely to be expedientfor performance rendering of a particular composition. Asimple comparison is difficult because of the difference incompositions and performer, but the combination of Clas-sical music used in M-G and M-P is able to render perfor-mances with more stable quality than the combination ofRomantic music used in M-G and M-P even for composi-tions with irrelevant musicality.

A. Georgaki and G. Kouroupetroglou (Eds.), Proceedings ICMC|SMC|2014, 14-20 September 2014, Athens, Greece

- 1275 -

Page 6: Laminae: A stochastic modeling-based autonomous ...speech.di.uoa.gr/ICMC-SMC-2014/images/VOL_2/1271.pdfstochastic models that associate artistic deviations observed in a performance

F beat only F note only overall

Aver

aged

ratin

gs

11.5

22.5

33.5

44.5

5 Data sets for model training:

95% confidence interval: Rating criteria 5: keeps characteristics of original3: has some humanity1: seems unreasonable

C-A M-G M-P average

(a) Averaged ratings for each (or both) model type.

11.5

22.5

33.5

44.5

5

J.S.Bac

h,

WTC1-23,P

relud

eB.Bart

ok,

Rouman

ian Folk

Dance

No.1

C.A.Debus

sy,

Prelud

e Boo

k1,

No.10

S.Prokofi

ev,

Piano S

onata

No.7

S.Rachm

anino

ff,

Prelud

e Op.3

,No.2M.Rav

el,

Sonati

ne M

v.1E.Sati

e,

Gymno

pedie

s

No.1

Aver

aged

ratin

gs

(b) Averaged ratings for each composition.

Figure 9: Subjective evaluation scores.

4. CONCLUSIONS

This paper proposed an autonomous system for automaticperformance rendering with high reproducibility of the char-acteristics of the performer. It uses stochastic models thatassociate tendencies of expression in the existing perfor-mance and their direction notated in the given score. Thestructure of automatically systematized models enables ef-ficient search for combinations of cases that are optimizedfor rendering performances.

Objective evaluations conducted indicate that the deci-sion tree backing-off algorithm enabled efficient search ofoptimum case series for rendering. The subjective eval-uation conducted showed that the system is able to renderperformances stably even for compositions with unconven-tional style. Consequently, performances rendered by theproposed system won first prize in the autonomous sec-tion of a performance rendering contest for computer sys-tems [17]. The quality of this system was also validatedvia a large-scale subjective evaluation with eighty partic-ipants and piano performance experts. The performancesrendered on that occasion are available on the web site thatsummarizes the results 2 . In addition, more samples ren-dered in a variety of other compositions are available onour web site 3 .

Acknowledgments

This research was supported in part by JSPS KAKENHI(Grant-in-Aid for Scientific Research) Grant Number26730182, and the Telecommunications AdvancementFoundation (TAF).

5. REFERENCES

[1] G. De Poli, “Methodologies for expressiveness mod-eling of and for music performance,” Journal of NewMusic Research, vol. 33, no. 3, pp. 189–202, 2004.

[2] G. Widmer and W. Goebl, “Computational models ofexpressive music performance: The state of the art,”

2 Results | Rencon 2013,http://smac2013.renconmusic.org/results

3 Laminae Articulates Musicians’ Intention ’N Artistic Expression,http://www.mmsp.nitech.ac.jp/˜k09/laminae

Journal of New Music Research, vol. 33, no. 3, pp.203–216, 2004.

[3] C. Palmer, “Music performance,” Annual review ofpsychology, vol. 48, no. 1, pp. 115–138, 1997.

[4] A. Gabrielsson, “Music performance research at themillennium,” Psychology of music, vol. 31, no. 3, pp.221–272, 2003.

[5] G. Grindlay and D. Helmbold, “Modeling, analyzingand synthesizing expressive piano performance withgraphical models,” Machine Learning Journal, vol. 65,no. 2-3, pp. 361–387, 2006.

[6] S. Flossmann, M. Grachten, and G. Widmer, “Expres-sive performance rendering: Introducing performancecontext,” in Proceedings of Sound and Music Comput-ing (SMC) Conference, Porto, Portugal, July 2009, pp.155–160.

[7] T. H. Kim, S. Fukayama, T. Nishimoto, andS. Sagayama, “Performance rendering for polyphonicpiano music with a combination of probabilistic mod-els for melody and harmony,” in Proceedings of Soundand Music Computing (SMC) Conference, Barcelona,Spain, July 2010.

[8] T. Suzuki, T. Tokunaga, and H. Tanaka, “A case basedapproach to the generation of musical expression,” inProceedings of the 16th International Joint Confer-ence on Artificial Intelligence, Stockholm, Sweden,July 1999, pp. 642–648.

[9] K. Hirata and R. Hiraga, “Ha-Hi-Hun: Performancerendering system of high controllability,” in Proceed-ings of the ICAD 2002 Rencon Workshop, Kyoto,Japan, July 2002, pp. 40–46.

[10] A. Tobudic and G. Widmer, “Relational IBL in clas-sical music,” Machine Learning Journal, vol. 64, no.1-3, pp. 5–24, September 2006.

[11] J. A. Sloboda, “The acquisition of musical perfor-mance expertise: Deconstructing the “talent” accountof individual differences in musical expressivity,” inThe road to excellence: The acquisition of expert per-formance in the arts and sciences, sports, and games.,K. A. Ericsson, Ed. Mahwah, NJ: Lawrence ErlbaumAssociates, Inc., December 1996, ch. 4, pp. 107–126.

[12] K. Okumura, S. Sako, and T. Kitamura, “Stochasticmodeling of a musical performance with expressiverepresentations from the musical score,” in Proceed-ings of the 12th International Society for Music Infor-mation Retrieval conference, Miami, FL, USA, Octo-ber 2011, pp. 531–536.

[13] C. E. Seashore, Psychology of Music. Courier DoverPublications, 1938.

[14] J. J. Odell, “The use of context in large vocabularyspeech recognition,” Ph.D. dissertation, CambridgeUniversity, 1995.

[15] S. Kataoka, N. Mizutani, K. Tokuda, and T. Kitamura,“Decision-tree backing-off in HMM-based speech syn-thesis,” in Proceedings of the 8th International Con-ference on Spoken Language Processing, Jeju Island,Korea, October 2004, pp. 1205–1208.

[16] R. E. Bellman, Dynamic Programming. PrincetonUniversity Press, 1957.

[17] K. Okumura, S. Sako, and T. Kitamura. A stochasticmodel of artistic deviation and its musical score forthe elucidation of performance expression. RenconWorking Group. [Online]. Available: http://smac2013.renconmusic.org/participants

A. Georgaki and G. Kouroupetroglou (Eds.), Proceedings ICMC|SMC|2014, 14-20 September 2014, Athens, Greece

- 1276 -