MaM SmithHoning

Embed Size (px)

DESCRIPTION

MaM SmithHoning

Citation preview

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    Journal of Mathematics and Music, Vol. 00, No. 00, Month 200x, 124

    Time-Frequency Representation of Musical Rhythm by

    Continuous Wavelets

    Leigh M. Smith and Henkjan Honing

    Music Cognition Group/ILLC, Universiteit van Amsterdam,Plantage Muidergracht 24, 1018TV, The Netherlands

    (Received 00 Month 200x; in final form 00 Month 200x)

    A method is described that exhaustively represents the periodicities created by a musicalrhythm. The continuous wavelet transform is used to decompose a musical rhythm into ahierarchy of short-term frequencies. This reveals the temporal relationships between eventsover multiple time-scales, including metrical structure and expressive timing. The analyticalmethod is demonstrated on a number of typical rhythmic examples. It is shown to make ex-plicit periodicities in musical rhythm that correspond to cognitively salient rhythmic stratasuch as the tactus. Rubato, including accelerations and retards, are represented as temporalmodulations of single rhythmic figures, instead of timing noise. These time varying frequencycomponents are termed ridges in the time-frequency plane. The continuous wavelet transformis a general invertible transform and does not exclusively represent rhythmic signals alone.This clarifies the distinction between what perceptual mechanisms a pulse tracker must model,compared to what information any pulse induction process is capable of revealing directlyfrom the signal representation of the rhythm. A pulse tracker is consequently modelled as aselection process, choosing the most salient time-frequency ridges to use as the tactus. Thisset of selected ridges are then used to compute an accompaniment rhythm by inverting thewavelet transform of a modified magnitude and original phase back to the time domain.

    Keywords: Rhythm; Rhythmic Strata; Expressive Timing; Continuous Wavelet Transform;Time-Frequency Analysis; Beat Tracking

    1. Introduction

    Despite a long history of computational modelling of musical rhythm [1, 2], theperformance of these models are yet to match human performance. Humans canquickly and accurately interpret rhythmic structure, and can do so very flexibly,for example, they can easily distinguish between rhythmic, tempo and timingchanges [3]. What are the representations and relevant features that humans sosuccessfully use to interpret rhythm? To address this, in this paper we investigatethe information content contained within the rhythmic signal when examined usinga multiscale representation.A musical rhythm can be distinguished, memorised and reproduced indepen-

    dently of the musics original pitch and timbre. Even using very short impulse-likeclicks, a familiar rhythm can be recognised, or an unfamiliar rhythm comprehendedand tapped along with. The rhythm is thus described from the inter-onset intervals(IOIs) between events alone, that is, the temporal structure.The rhythmic interpretation of those temporal patterns has received consider-

    able research, notable summaries include Fraisse [4], Clarke [5] and London [6].

    Corresponding author. Email: [email protected]. This research was realized in the context of theEmCAP (Emergent Cognition through Active Perception) project funded by the European Commission(FP6-IST, contract 013123). Thanks go to Peter Kovesi, Robyn Owens, Olivia Ladinig, Bas de Haas forcomments on drafts of this paper.

    Journal of Mathematics and MusicISSN: 1468-9367 print/ISSN 1468-9375 online c 200x Taylor & Francis

    http://www.tandf.co.uk/journalsDOI: 10.1080/1468936YYxxxxxxxx

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    2 Smith and Honing

    Yeston [7] characterised musical time as consisting of a hierarchy of time peri-ods spanning individual notes, bars, and phrases, terming these periods rhythmicstrata. Lerdahl and Jackendoff [8] made the important distinction between theprocess of grouping of events across time, and the induction of the musical meterwhich arises from the regular re-occurrence of accented events. These researchersboth noted the role of tactus, constituting the most prominent musical pulse rateand phase that is induced by the listener. The tactus typically appears as the rateand time points that listeners will tap to (typically around 600mS [4]) when hearinga musical rhythm.Such attending to a musical rhythm is proposed by Jones [9] to use two strate-

    gies, future-oriented and analytic processes. These project expectancies forwardin time, and retrospectively assess experienced (i.e., heard) events, respectively.Both strategies are argued to occur simultaneously and to be judged with respectto the hierarchical time levels established by the rhythm. The perception of arhythmic pulse and its ongoing attending can be characterised as composed of tworesonant processes. These consist of a bottom-up beat induction process, and aschematic expectation process providing top-down mediation of the perceptionof new events [10, 11]. Gouyon and Dixon [12] illustrate and characterise com-putational models according to a similar machine-learning oriented architecture,describing the top-down process as pulse tracking .This tracking task has remained an unsolved research problem in part due to

    the effect of expressive timing. Musicians purposefully use expressive timing toemphasise the structural aspects of the piece, such as the metrical and phrasestructure [13]. A representation of musical rhythm must therefore account for theestablishment of rhythmic hierarchy, the induction of pulse and the role and effectof expressive timing. In this paper, we describe a representation of musical rhythmusing an established transformation, the continuous wavelet transform, appliedto the inter-onset timing of events. This representation is demonstrated to makeexplicit the rhythmic hierarchy and expressive timing. Furthermore, it allows anumber of beat tracking methods to be applied to determine the tactus implicit inthe rhythmic signal.

    1.1. Previous Work

    Considerable research has been directed at designing models of both pulse induc-tion and tracking processes towards the final goal of producing useful and robustmodels of musical time. Existing models have used various approaches includinggrammars [14, 15], expectancy [16], statistics [17, 18], Minskyian agents [19], oscil-lator entrainment [2,20] and other self-organising connectionist systems [2123]. Arecent review of rhythm description systems is provided by Gouyon and Dixon [12].Common problems confronted and addressed in a diverse manner by these ap-proaches are the representation of temporal context, order and hierarchy, and therole of expressive timing and tempo within the existing rhythmic structure.Since musical rhythm can be induced from clicks alone, a rhythmic function for

    analysis is created by representing the time of each onset as a unit impulse function.The rhythm function for a piece of music is therefore a train of impulses withintervals matching the IOI between onsets. A pulse-train function can be seen tobe a minimal or critical sampling of the auditory amplitude envelope at the lowestsampling frequency which still accurately represents the rhythm function. Thisyields one sample at the point in time at which each musical event becomes audibleto the listener. This is an onset-based representation of rhythm, and is typicallyrecovered by a detection algorithm operating on the signal energy of the acoustic

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    Time-Frequency Representation of Rhythm 3

    signal, |y|2. This is effectively a rectification of the audio signal to deconvolve theamplitude envelope from the auditory carrier signal [24,25]. Alternatively the onsettimes are obtained by transducing a musicians actions on a sensing instrument(e.g., MIDI). This is distinguished by Gouyon and Dixon [12] from a time-domainframe based system which aims to determine the rhythmic signal directly from theauditory signal. However in practice, systems must rectify in order to deconvolvethe rhythm from the auditory signal.The explicit treatment of a rhythm as a signal, applicable to digital signal pro-

    cessing methods, has only recently become widely adopted as a computational ap-proach to rhythm perception. Notable early examples have been the use of autocor-relation methods by Desain and de Vos [26] and Brown [27] to detect periodicitiesin MIDI data and audio recordings, respectively. Goto and Muraoka [28] developeda beat tracking system capable of processing musical audio in real time. Their sys-tem manages multiple hypotheses created by beat prediction agents. These agentstrack onset time finders which operate on spectral energy representations. Thefrequency and sensitivity parameters of the onset time finders are tuned by theagents. Domain knowledge is used to pattern match against characteristic spec-tral features (snare and bass drums) and typical popular music rhythms (strong-weak-strong-weak repetition). Beat intervals are determined by interval histogramsweighted by reliability of tracking estimates. As the authors admit, the systemmakes strong assumptions on the music to be tracked, popular songs in 4/4 andtempo constrained in range and variability.Formalising the dynamic attending model of Jones and Boltz [9], Jones and

    Large [2] have developed a model based on coupled self-oscillating attendingrhythms which entrain to external (performed) rhythms. Phase and period (fre-quency) coupling of the attending rhythm to the external rhythm enable variationsin the timing of the external rhythm to be tracked. Internal coupling between os-cillators is designed to pull pairs into a 2:1 phase-locked relationship. The modelrelies on an attentional focus period during each attentional rhythms oscillationthat creates a period of enhanced responsiveness (attentional targetting or atemporal responsive field [29, pp.80]). The period of focus narrows with repeti-tion, responding to onsets which are expected to fall within this attended period.Conversely, onsets which fall at half periods (twice the rate) lie outside the atten-tional focus. Hence, the model is insensitive to change at these points in time [29].In the currently described model, to begin to track an external rhythm which thenchanges to double time, a new oscillator must begin tracking, starting from a wideattentional focus, or the currently attending rhythm must adapt to this new doublerate rhythm. The dynamics of each oscillator form a time and frequency varyingfilter which can adapt to timing variations from the musical rhythm.Scheirer [25] demonstrated the amplitude envelope of a musical audio signal

    alone is sufficient to induce a rhythmic percept that matches the original musicalsignal. His model uses a six band filter bank to attempt to separate polyphonicinput by frequency bin. This assumes that any polyphony occurring within thatfrequency band will not be rhythmicly contradictory (i.e., play in polyrhythm). Theband-pass envelope channel output is rectified, smoothed, differentiated and thenhalf-wave rectified in order to sharpen the amplitude envelope of the sub-band toan approximation of an onset signal. This impulses a causal resonant comb filterbank which resonates at the periodicity of the sub-band. The comb filters onlyresonate with onsets at periods matching their delay times so a sufficient numberof resonators is required to track tempo deviations in real time. Therefore Scheirertests against music which must exhibit a strong beat.In a similar fashion, Sethares and Staley [30] used a signal decomposition method

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    4 Smith and Honing

    that extracts strictly periodic components from audio sub-bands. They select ba-sis elements according to several different criteria, including the best correlationof the elements to the sub-band signal. As the authors note When the tempo ofthe performance is unsteady, the periodicity methods fail, highlighting the meth-ods reliance on a steady underlying pulse [30, pp. 152]. Such unsteady tempoincludes musically essential gestures such as ritardando and accelerando so thisapproach and Scheirers seems to be limited in their applicability for music thatexhibits such expressive timing.Klapuri et al. [31] generalises the preprocessing of the audio signal of Scheirer [25]

    and Goto and Muraoka [28] to produce many sub-bands which are then summedto a subset of four channels, termed accent signals. These accent signals arethen subject to periodicity detection. This is performed by a bank of comb filterresonators (matching Scheirers) whose output are summed to a single measure ofcurrent periodicities. Three hidden Markov models comprise a probabilistic modelto estimate the periods and phases of the tactus, bar and shortest rhythmic interval(tatum) from the summed and individual resonator energies, respectively. Whiledescribed probabilistically, the periods and phase probabilities are not determinedby training, they are derived from features chosen by hand, estimated to have aproportional influence. Phase of the bar is chosen by pattern matching againsttwo expected energy time profiles derived by inspection. Using such an approachoutperforms two reference systems, Scheirer [25] and Dixon [32], although pieceswhich exhibit substantial expressive timing were not tested.Modelling tempo tracking as a stochastic dynamical system, Cemgil et al. [33]

    represent tempo as a hidden state variable estimated by Kalman filtering. Themodel of tempo is a strict periodicity with an additional noise term that describesexpressive timing as a Gaussian distribution. The Kalman filter uses a multiscalerepresentation of a real performance, the tempogram. This is used to providea Bayesian estimation of the likelihood of a local tempo given a small set of on-sets. The tempogram is calculated by convolving onset impulses with a Gaussianfunction which is then decomposed onto multiscale basis functions. These basesare similar to Scheirers comb filters, forming a local constant tempo that is usedto match against the incoming rhythm. By considering rhythm as locally constant,and timing deviations as noise, the system does not take advantage of the implicitstructure of a performers expressive timing.Todd [24] applied banks of Mexican hat filters, analogous to the primal sketch

    theory of human vision [34], towards an auditory primal sketch of rhythm. Thisproduces a rhythmogram representation of an audio signal. A recent publicationby Todd [35] postulates a rhythm perception model based in the auditory peripheryand controlling directly the musculoskeletal system. While such a neurobiologicallyinspired model may be plausible, it is difficult to measure the contribution of eachcomponent of the model against a base-line. For example, the performance of amodel may be due to its accurate representation of neurobiology, or alternativelymostly due to the behaviour of the signal processing systems incorporated therein.A model that attempts to relate directly to components of the human auditoryperiphery may not provide the simplest explanation for the output, making verifi-cation difficult. Given the massive connectionism present in neurobiology, a modelmay not reflect neurological architecture in a sufficient manner to produce accurateresults, or be simply too computationally expensive to test thoroughly.Smith [36] and Smith and Kovesi [37] reported the use of the continuous wavelet

    transform as a means of analysis of rhythm consisting of onsets as an impulsetrain. The output of the transform is similar to Todds rhythmogram, but moredetailed. The representation reveals a hierarchy of rhythmic strata and the time

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    Time-Frequency Representation of Rhythm 5

    of events by using a wavelet that has the best combined frequency and time res-olution. Such bottom-up data-oriented approaches, including the multiresolutionmethod described in this paper, do not fully account for human rhythm cognition.In addition, rhythm perception is clearly influenced in a top-down manner by thelisteners memory developed by a combination of explicit training and learningthrough exposure. A goal of this paper is to clarify the information which is inher-ent (i.e., retrievable) in the temporal structure of a musical rhythm. This aims toestablish a base-line measure to evaluate the explanatory contribution of differentmodels of musical time. In short, the intention of this paper is to demonstrate howmuch information can be retrieved from the rhythmic signal itself.

    1.2. Proposed Method

    The paper is organised as follows: The analytical technique of a continuous wavelettransform is reviewed in Section 2. The application of this transform to musicalrhythm to produce a rhythmic hierarchy is described in Section 3. Musical rhythm isdescribed in terms of signal processing theory and distinguished from the auditoryspectrum in Sections 3.1 and 3.2. A simplified schema model is used to determinethe tactus (foot tapping rate) in Section 3.3. This extracted tactus is then used tocompute a beat tracking rhythm to accompany the original rhythm in Section 3.4.The analysis and representation of rhythms with expressive timing is demonstratedin Section 4.

    2. The Continuous Wavelet Transform

    Expressive timing, agogic, dynamic and other objective accents produce complex,frequency and amplitude varying rhythmic signals which requires a non-stationarysignal analysis technique. Analytical wavelets are well suited to this task. Thefollowing section is a review of the continuous wavelet transform, further detail isprovided in Smith [38].Wavelet theory has historical roots in the analysis of time varying signals, the

    principle being to decompose a one-dimensional signal s(t) at time t, into a non-unique two-dimensional time-frequency distributionWs(t, f), representing frequen-cies changing over time [39,40].Earlier signal analysis approaches have used the short term Fourier transform

    (STFT), a time windowed version which decomposes a signal onto harmonicallyrelated basis functions composed of weighted complex exponentials. The STFT is

    Fs(t, f) =

    s() h( t) ei2pif d ,

    where h(t) is the complex conjugate of the window function. Significantly, thewindow functions time scale is independent of f .In contrast, the continuous wavelet transform (CWT) [4042], decomposes the

    signal onto scaled and translated versions of a mother-wavelet or reproducing kernelbasis g(t),

    Ws(b, a) =1a

    s() g( ba

    ) d , a > 0, (1)

    where a is the scale parameter and b is the translation parameter. The scale pa-

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    6 Smith and Honing

    rameter controls the dilation of the window function, effectively stretching thewindow geometrically over time. The translation parameter centres the window inthe time domain. Each of the Ws(b, a) coefficients weight the contribution of eachbasis function to compose s(t). The geometric scale gives the wavelet transform azooming capability over a logarithmic frequency range, such that high frequen-cies (small values of a) are localised by the window over short time scales, and lowfrequencies (large values of a) are dilated over longer time scales [43]. This formsan influence cone [44] which has a time interval, for each scale and translation,between [atl + b; atr + b] for a mother-wavelet with time support over the interval[tl, tr].Resynthesis from the transform domain back to the time domain signal is ob-

    tained by

    s(t) =1cg 1

    a

    Ws(b, a) g( t ba

    )dadb

    a2, (2)

    the constant cg is set according to the mother wavelet chosen

    cg = |g()|2|| d

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    Time-Frequency Representation of Rhythm 7

    -30 -15 15 30 a = 2

    -1

    1

    -30 -15 15 30

    -1

    1

    -30 -15 15 30 a = 1

    -1

    1Real

    -30 -15 15 30

    -1

    1Imaginary

    Figure 1. Time domain plots of Morlet wavelet kernels, showing real and imaginary components for themother wavelet and a version dilated by a = 2.

    t 14pi

    .

    This led Gabor [47] to propose its use for basis functions which incorporateboth time and frequency. Subsequently, Kronland-Martinet, Morlet and Gross-mann [43, 48, 49] applied such a wavelet to sound analysis, however, the researchreported here (and in [36, 37]) differs from their approach in that it is the rhythmsignal (the function that modulates the auditory carrier) that is analysed using so-called Morlet wavelets not the entire sound signal. Here the rhythm is analysedindependently (effectively deconvolved) of the auditory carrier component.However the kernel of Equation 4 does not meet the admissibility condition of a

    zero mean for exact reconstruction [38, 42]. The asymptotic tails of the Gaussiandistribution envelope must be limited in time such that the residual oscillationswill produce a non-zero mean. Given that much analysis can be performed withoutrequiring exact reconstruction, this is not a problem in practice, particularly to theapplication of musical rhythm analysis. Likewise, the Gaussian envelope rendersEquation 4 close to a progressive support or analytic wavelet, nearly satisfying,

    < 0 : g() = 0.

    Equations 1 and 4 produce complex valued results and due to their analyticnature, the real and imaginary components are the Hilbert transform of each other.The conservation of energy of progressive analytical wavelets allows the modulusof a wavelet transform to be interpreted as an energy density localized in thetime/scale half-plane. An analytic (progressive) signal Zs(t) of s(t) can be definedin polar coordinate terms of modulus As(t) and phase s(t) as

    Zs(t) = As(t)eis(t). (5)

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    8 Smith and Honing

    The magnitude and phase of the wavelet coefficients Ws(b, a) can then be plottedon a linear time axis and logarithmic scale axis as scaleogram and phaseogramplots (see for example Figure 4), first proposed by Grossmann et al. [50]. Thediscretised version of the phase of the wavelet transform (b, a) = arg[Ws(b, a)](the phasogram) can be recovered due to the nature of the near-analytic motherwavelet (Equation 4). Phase values are mapped onto a colour wheel or grey-scaleto visualise the regularity of the progression of phase. To improve clarity, phasevalues are clamped to 0 where they correspond to low magnitude values, otherwise

    |Ws(b, a)| > m (6)

    where the magnitude threshold m = 0.005, registers the phase measure as valid todisplay.

    2.2. Ridges

    A group of researchers, (well summarized by [42, Chapter 4]) have used points ofstationary phase derived from wavelet analysis to determine ridges indicating thefrequency modulation function of an acoustic signal. This determines the frequencyvariations over time of the fundamental and a finite number of partials. The chiefmotivation of this research was to reduce the computation of the transform to onlythe ridges, collectively termed a skeleton [42, 5153]. In that application, thesignal analysed was the sampled sound pressure profile.The motivation here is to extract the frequency modulation function for the

    purpose of determining a principal rhythmic partial that corresponds to the tactus.Several methods of ridge extraction from the magnitude and phase of the time-frequency plane are combined to identify the ridges. The combination R is a simpleaveraging of the normalised ridge extraction methods

    R =|Ws|++

    3, (7)

    where the ridge extraction methods are maximum magnitude |Ws|, stationaryphase , and local phase congruency . The peak points (b, a) = R(b, a) withrespect to the dilation scale axis a are found at

    R

    a= 0, (8)

    when

    2R

    a2< 0. (9)

    An implementation to reorder (b, a) into data structures of individual ridgesis described in Section 3.3. A modified version of stationary phase is described indetail in Smith [38].

    2.2.1. Local Phase Congruency. Local phase congruency is a modified form of

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    Time-Frequency Representation of Rhythm 9

    phase congruency used in image processing research for feature detection [54, 55].Local phase congruency considers the degree of alignment of phase angles overranges of consecutive scales to indicate the presence of a frequency component inthe rhythmic signal.Due to the Gaussian shaped modulus profiles, a wavelet transform of an isochro-

    nous pulse will activate several consecutive scales, with a common phase revolution(Figure 4). Therefore adjacent phase angles which are most in synchrony are in-dicative of a spectral component. Local phase congruency is therefore proposed asthe troughs (local minima) of the first derivative of the phase with respect toscale. These are found by

    |s(b, a)|a2

    = 0, (10)

    which finds the local extrema. The trough points along the scale a are found by

    2|s(b, a)|a3

    > 0. (11)

    Normalised local phase congruency is therefore

    = 1s(b, a)a

    . (12)The constant m = 0.01 for Equation 6 ensures congruency is only calculated for

    scales and times where the magnitude is large enough to allow the phase to be wellconditioned.

    3. A Multiresolution Time-Frequency Representation of Musical Rhythm

    3.1. Non-Causality

    As the time domain plots indicate (Figure 1), the Morlet wavelet is non-causal ,running forward and backward in time. A causal system is one that depends onpast and current inputs only, not future ones [56]. Non-causality implies the wavelettransformation must be performed on a recorded copy of the entire signal, andso is physically unrealisable in real-time. The wavelet is therefore considered interms of an ideal theoretical analysis kernel, summarising a number of cognitiveprocesses, rather than one existing in vivo as a listeners peripheral perceptualmechanism. However it should be noted that Kohonen [57] has presented evidencefor the self organisation of Gabor wavelet transforms, so such representations arenot impossible to realise with a biological system.However, there are reasons to entertain the idea that the mechanisms used in the

    process of rhythm induction are not solely dependent on past information alone.Mere exposure to rhythms from previous listening has been shown to construct aschema used to aid perception [58]. The use of temporal context for attentionalenergy has been argued for rhythm by Jones et al. [9, 59], and in terms of pulsesensations, by Parncutt [60]. New rhythms are perceived with respect to previouslyheard rhythms and are organised and anticipated within the harness of a particularschematisation. In that sense, the perception of a current beat has an expectancy

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    10 Smith and Honing

    weighting, projecting from the future back towards the present.A performer will practice a phrase such that each beat is performed within the

    context of beats imagined and intended, but yet to be performed [61]. Even inimprovised music, with the notion of learned riffs, Indian paltras [62] and otherimprovisation training methods [63], each beat is performed within the context ofintended future beats as well as those beats already performed. Given the cross-cultural nature of most musical development, most listeners will share and under-stand the cultural implications as the phrase develops. They will predict subsequentbeats, and draw meaning and emotion from the confirmation of such predictions,as Meyer [64] has argued.Listeners use their corpus of learned rhythms as a reference point to subsume the

    new performance. A purely causal model will be limited in its success because itis not taking into account the prediction and retrospection possible of the perfor-mance as it proceeds. Gouyon and Dixon [12] illustrate the ambiguity of local vs.global tempo changes and timing changes which can only be resolved by retrospec-tion. Rhythmic periodicities are disambiguated over a span of time which formsa moving window. For computational approaches, this requires providing the rep-resentation of a rhythmic schema which may be considered as abstract structuralregularities derived from the music the listener is exposed to.The non-causal projection of the Morlet wavelet can be viewed as an idealistic

    aggregation of such predictive memories. Backwards projection of the filter is amodel of completion of an implied rhythm. It functions as retrospective assess-ment of the rhythm, as argued by Jones and Boltz [9] and Huron [65]. Its use doesnot seek to apportion rhythm perception behaviour between biological and cul-tural processes. Clearly the Morlet wavelet is an oversimplification of the rhythmperception process. Despite the Morlet wavelet being a theoretic formalism, andbeing a basis for smooth functions, it has several positive attributes as a waveletfor rhythm analysis.

    3.2. Input Representation

    The onset impulses are weighted by a measure of the phenomenal importance ofeach event. This summarises the influence from all forms of phenomenal accentwhich impinge upon the perception of the event, not only dynamic accents. This isdenoted by (t) = c(v) (t), where (t) is the rhythm function composed of sparseimpulse values (0.01.0) and c(v) is the normalised phenomenal accent. Minimallyc(v) is the intensity of the amplitude of the onset. This is a simplifying assumptionthat there is a linear relationship between the perceptual salience of an individualdynamic accent and the intensity of a beat. This ignores the effect of maskingof beats by temporal proximity and other non-linearities between intensity and itsfinal perceptual salience. Masking [66], auditory streaming [67] and expectation (forexample, from tonal structure and subjective rhythmisation, [4]) can be modelledby a hypothetical non-linear transfer function c(v). This would summarise the effectof context on the perceptual impact of the note.Alternatively, if a frequency representation is used which preserves energy (by

    Parsevals relation [56]), such perceptual effects can be modelled in the frequencydomain. Masking and streaming occurs with IOIs shorter than 200ms, smaller thanall but the shortest intervals found in performed rhythms. Therefore the impact ofthese effects on phenomenal accent is assumed here to be negligible. The samplingrate can be low (200Hz) as the audible frequencies are not present. The multipleresolution analysis is therefore performed over the frequencies comprising musicalrhythm spanning 0.1 to 100Hz, but for analysing performances the very shortest

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    Time-Frequency Representation of Rhythm 11

    scales (less than 4 samples, 20 milliseconds) do not need to be computed.The perception of a polyphonic rhythm (comprising different instruments or

    sound sources) is assumed to involve segregation into separate simultaneous rhyth-mic patterns by using common sound features or changes. Where the listener caninterpret a rhythm as comprising multiple rhythmic lines, rather than variationsin accentuation of a single rhythm, this is assumed to introduce two or more in-dependent rhythms running parallel in time. Furthermore, each is assumed to beanalysed separately by parallel processes.A clearer model of musical time can be constructed in terms of the time-frequency

    representation of rhythm, rather than strictly in the time domain. The invariance ofthe Gaussian envelope between time and frequency domains of the Morlet waveletprovides the best simultaneous localisation of change in time and frequency. Otherkernels will achieve better resolution in one domain at the expense of the other.Arguably, the Morlet wavelet therefore displays the time-frequency componentsinherent in a rhythmic signal, prior to the perceptual processes of the listener. Usingsuch wavelets allows quantifying the representative abilities of other multiresolutionapproaches to rhythm models.The explicit representation of multiple periodicities implied by the temporal

    structure of events can be considered as a pulse induction process that forms thepulse percept across an integrating period. The pulse induction process producessimultaneously attendable pulses [9], matching the concept of multiple hypothe-ses used in beat tracking systems [15, 28, 68], but in the case of the CWT, arisingas a direct result of the basis representations (filter impulse responses). This isdemonstrated in Section 4. The top down persistent mental framework process(schema) is then responsible for selection from the time-frequency plane of oneor more ridges which constitute the most appropriate pulse to attend to. This isdescribed in Section 3.3.

    3.3. Tactus Determination

    As a minimum demonstration of interpretation of a rhythm, the multiresolutionmodel is used to determine the tactus. This tactus is verified by using it to com-pute a beat track to accompany the original rhythm. The tactus can be consideredto function as the carrier in classical frequency modulation (FM) theory. An iso-chronous beat is a monochromatic rhythmic signal, and the performers rubatoconstitutes a frequency modulation of this idealised frequency. In performance,the tactus of a rhythm is modulated but still elastically retains semi-periodic be-haviour. A means of extracting the rubato frequency modulation (the ridge) of thetactus is required. The schematic diagram of Figure 2 describes this process.A process which can be considered a pulse tracker extracts the ridge determined

    to be the tactus. This tactus ridge is then transformed from the time-frequencyplane back to the time domain and its phase measure is used to compute the timepoints of a beat track.

    3.3.1. Ridge Extraction. Section 2.2 described the process of combining mag-nitude and phase measures and identifying the peak points (b, a) in the time-frequency plane. These peak points are then grouped by time-frequency proximityinto a skeleton of ridges. The skeleton of ridges is implemented as a list of Com-mon Lisp objects, each ridge defining its starting location in time and the scaleat each subsequent time point. Representing such ridges as objects (equivalentlyMinskyian frames) aims to establish a knowledge framework that bridges the di-

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    12 Smith and Honing

    Magnitude(time-frequency coeffs)

    Continuous Wavelet Transform

    Phase(time-frequency coeffs)

    Rhythmic Signal (sparse impulses)

    Stationary Phase Local Phase CongruencyModulus Maxima

    Ridge Combination

    Tactus Extraction

    Tactus Phase Foot-tapping

    Foot-tap (time points of taps)

    Combined Ridge(time-frequency coeffs)

    Algorithm =

    Data =

    Figure 2. Schematic diagram of the multiresolution rhythm interpretation system.

    vide between symbolic and sub-symbolic representations of rhythm. An algorithmto reorder (b, a) into ridges %(b) = a is shown in Figure 3. With a tolerance t = 1,this extracts ridges with a rate of change of at most by one scale per time sample.Such a skeleton represents alternative accompaniment or attending strategies

    and is query-able as such. This representation serves as a foundation to developin a modular fashion alternative pulse tracking or schematic processes. This isto allow schemas to be developed which decouples the model of ridge selection(e.g., pulse tracking) from pulse induction. This decoupling seems necessary inorder to adequately model schematic knowledge which may contradict the veridicalexpectancy derived from the surface material. For example, the tactus of a reggaerhythm being at half time to the quaver (eighth note) pulse.

    3.3.2. Ridge Selection. The simplest schema process to select a ridge as tac-tus is a heuristic which selects the longest, lowest frequency ridge that spans theanalysis window. In effect this can be considered the rhythmic fundamental fre-quency. While such an axiom or principle may seem problematically reduction-ist, the heuristic proposed describes a broad behaviour that applies to the entirerhythm. This represents that there is an inherent coherence in the temporal struc-

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    Time-Frequency Representation of Rhythm 13

    Algorithm: Extract-Ridges1: for b 0 . . . B ; Time extent of analysis window.2: Sb = a if (b, a) > 0 ; Those peaks at each time point b.3: (Sb, S

    b1) = Sb Sb1 ; Match current peaks and those at previous time point.

    3: if %r(b) Sb1 ; Update the existing ridges with scales that match4: %r(b) Sb ; those at the previous time point.5: %r+1(b) Sb Sb1 ; Create new ridges for scales that dont match.

    The sets (J ,K ) = J K containing those elements of J and K matching within atolerance t is defined as:

    Algorithm: Match-Sets0: Ensure J and K are both sorted in numerical order.1: n 0 ; The next unmatched index2: for j J3: for k n . . . len(K) ; Find where j is located in K4: if Kk > j + t5: exit-for6: if |j Kk| t7: J = j J ; Collect those elements from J that match.8: K = Kk K ; Also collect those elements for K that match.9: n k + 110: exit-for11: n k ; Remove all earlier elements from future tests.

    Figure 3. Algorithm for the extraction and reordering of peak points in the combined magnitude andphase measures into ridge structures % ordered by time.

    ture of the entire rhythmic sequence analysed. Indeed, the low frequency compo-nents are derived from the contribution of all events analysed. The selection of sucha ridge is shown on the phaseograms of Figures 6 to 10.Other simple schemas are to establish tempo constraints in choosing a ridge as

    Parncutt [60] adopts with 1.38Hz, and Todd [35] with 2Hz band pass filters. Morecomplex schemas can then be introduced in the future and compared to this ob-viously overly simplistic model. The query-able nature of the skeleton also enablesmanual selection and testing of ridges for evaluation of their role as modulatedrhythmic strata by using reconstruction.

    3.4. Reconstruction of the Tactus Amplitude Modulation

    Once the tactus has been extracted from the rhythm, it can be used to compute taptimes. When sampling a tactus which undergoes rubato, it is not sufficient to simplysample the instantaneous frequency of the tactus ridge due to the accumulation oferror. Therefore the tactus ridge is transformed from the time-frequency domaininto an FM sinusoidal beat track signal in the time-domain. Only the tactus ridgeitself will contribute to the beat track signal. The sinusoidal nature of the resultingsignal enables it as an amplitude envelope, that is, as a rhythm frequency thatmodulates over time. This signal is reconstructed from both the scaleogram andphaseogram coefficients.All scaleogram coefficents other than those of the tactus ridge are clamped to

    zero, while the original phase is retained. This altered magnitude and the originalphase components are converted back to Ws(b, a) coefficients and reconstructed

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    14 Smith and Honing

    back to a time domain signal, using Equation 2. The constant cg = 1.7 was deter-mined by calibrating the original time signal with its reconstruction s(t) to achieveenergy conservation. Due to the asymptotic tails of the Gaussian, the reconstruc-tion cannot be perfect, but the reconstruction was determined to still accuratelyresynthesise the frequency and phase of signals. The real component of the recon-struction

    As(t) =

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    Time-Frequency Representation of Rhythm 15

    Figure 4. An isochronous rhythm (fixed IOI of 1.28 seconds) shown in the top graph as atime-amplitude plot of impulses, then represented as a scaleogram and phaseogram (magnitude and

    phase of the continuous wavelet transform coefficients), respectively. The most activated ridge (candidatefor tactus) is centered on the scale corresponding to an interval of 1.28 seconds and is indicated by thehorizontal black line on the phaseogram. A lower energy ridge is centered on the interval of 0.64 seconds.

    This secondary ridge occurs from interactions between secondary lobes of the wavelet.

    is shown in Figure 4. The abscissa axis represents time in seconds, the ordinateaxis is logarithmic, represented here by the time extent of each wavelet voice,again in seconds. The scale with the highest modulus represents energy densityand corresponds to the frequency of the beat. This frequency is the reciprocal ofthe IOI, as indicated in Figure 4. The timing of the onset intervals between beatswill be reflected by the energised scales.From Figure 4, the relative energy levels of each scale is indicated. In addition to

    the most highly activated scale corresponding to an IOI of 1.28 seconds, there is asecondary lobe of half amplitude energy at the first harmonic of the beat rate (0.64seconds). This is caused by coincidence of the half-amplitude second oscillations ofthe kernels in the time domain (Figure 5, by 0 = 6.2 of Equation 4). The forwardtime projection of the nth beat will positively add with backward time projectionof the (n + 1)th beat at the first and second oscillations of the kernel, producingenergy at the first and second harmonic of the beat rate. These artifacts arise fromthe Morlet kernel and are dependent on the 0 value, more oscillations producingfurther low energy harmonics. Therefore a very slight signal energy at the thirdharmonic can be discerned in Figure 4.While artifactual in nature, these harmonics can be considered as representing a

    listeners lower propensity to perceive an isochronous rhythm as actually at doublethe rate of the events. From another perspective, second and third harmonics fromrespective rhythms at half and one third rates will contribute to the total signalenergy measured at a given rhythmic frequency. This effect occurs in the Morlet

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    16 Smith and Honing

    -30 -15 15 30 45 Real

    -1

    1Beat n Beat n+1

    -30 -15 15 30 45Imag

    -1

    1Beat n Beat n+1

    Figure 5. Time domain plots of the overlap of the real and imaginary components of Morlet waveletkernels. These demonstrate the cause of the reduced energy second harmonic in the scaleogram.

    wavelet as a consequence of the nature of the Gaussian envelope of the Gabor kernelwhich is modelling the Heisenberg inequality of time and frequency representation.This implies that secondary preferences for doubling or to a lesser extent, tripling

    a rhythm, is inherent in this model of rhythm, rather than learned. Categorisationof rhythmic ratios towards 2:1 in reproduction tasks [4, 71, 72] does indeed showthat these ratios are privileged. Musical performance practice and other activitiesinvolving doubling a motor behaviour are easily accomplished by humans. Thatpractice is reflected in the ubiquity of duple and triple representations in musicalnotation of rhythm. It is quite possible that this motor production optimisation ismatched by an inherent perceptual process favouring simple subdivisions of time.

    4.2. Non-isochronous Rhythm

    The representation of a typical musical rhythm is shown in Figure 6. The res-olution of the CWT makes the short term frequencies of the rhythm apparent.Scales are reinforced from time intervals which overlap, and fade where onsets nolonger contribute impulsive energy. This produces a set of simultaneous ridges inthe scaleogram as the rhythm progresses across time and creates different inter-secting intervals. The most highly activated ridges are the ones which receive mostreinforcing contributions from re-occurring intervals. The phase diagram illustratesthe presence of periodicities at individual scales by the steady progression of phase(i.e., the spectrum) across time.The impulse signal is padded with repetitions of the signal up to the next dyadic

    length to account for edge conditions so that the last interval is part of the analysedsignal. Therefore the entire signal is analysed in the context of it being repeated.This does not compromise the time-frequency resolution of the CWT, since signalsare not assumed to be periodic within the analysis window as is the case for aFourier representation. An alternative representation is to pad the analysed signalwith silence on both ends. This has minimal impact on representations of onsetswithin the signal, but reduces the contribution of the first and last beats.

    4.3. Dynamic Accents

    Figure 7 illustrates an analysis of a rhythm composed of a meter changing froman intensified beat every three beats to an accent every four beats and then re-turning to every three beats. The IOIs remain equal across the pulse train, onlythe beat which is intensified is changed. As can be seen, a band of frequency scalescorresponding to the interval between accented beats is established during the 3/4meter period, dips downwards for the 4/4 and returns to the previous scale. This

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    Time-Frequency Representation of Rhythm 17

    Figure 6. Scaleogram and phaseogram plots of an analysis of the score time of the musical cliche [73].The multiple periodicities implied by the event times appear as time-frequency ridges. The lowest

    frequency ridge that extends across the entire analysis window is indicated on the phaseogram by theblack line and is visible as energy on the scaleogram.

    demonstrates the zooming of the multiresolution model and its ability to track ashort term change in frequency. The phaseogram in Figure 7 indicates congruenceover ranges of scales corresponding to the rhythmic band. Additionally, the phasehighlights the points of change in the signal, where a frequency (meter) changeoccurs. The non-causal nature of the convolution operator used in the CWT pin-points the rhythmic alternation.

    4.4. Timing

    A key feature is that this pulse induction process can identify a time modulated(retarding/accelerating) rhythmic pulse. It does not consider pulses as required tobe isochronous and therefore can avoid representing local deviations in rhythmicfrequency as noise. As a demonstration, two scaleograms of rhythms are dis-played in Figures 8 and 9. Both analyse the same number of events, all impulsesare equal in amplitude and the size of deviations in expressive timing are alsoequal. The variations in the grouping, in four (Figure 8) and in three (Figure 9),are made visible by the multiresolution representation. The multiple timescalesreflect the different arrangement of the timing groups, and the periodicity of thesegroups appears as a low energy (but still visible and detectable) ridge. The abil-ity to discriminate quasi-periodicity means that grouping from expressive timingwhich is not strictly periodic is also able to be represented. However this timingshift is subtle, when supported only by timing accentuation alone. Co-occurringdynamic accentuation would make the grouping more apparent to the listener and

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    18 Smith and Honing

    Figure 7. An isochronous rhythm changing in meter by variations in amplitude. The upper plot showsthe impulse amplitudes, with the meter changing from 3/4 to 4/4 over the period of 4.2 to 11.2 seconds.The scaleogram (middle) and phaseogram (lower) plots display a continuous wavelet transform of the

    rhythmic impulse function. The intensity variations of the impulses are discernable in the scaleogram atshort IOI scales, and the time-frequency ridge with the most energy is at 0.35 seconds matching the IOI.A lower energy ridge is visible on the scaleogram, and more clearly on the phaseogram, changing in its

    period from 1.05 seconds to 1.4 seconds matching the duration of the bar. It is marked on thephaseogram as a black line.

    in the scaleogram. Such accentuation is reflected by the CWT as demonstrated inFigure 7.

    4.5. A Performed Rhythm: Greensleeves

    Figure 10 demonstrates the CWT applied to a performed version of a well-knownrhythm example with multiple IOIs grouped in musically typical proportions. Therhythm timing is obtained from a performance played by tapping the rhythm ona single MIDI drum-pad. The scaleogram and phaseogram indicate the hierarchyof frequencies implied at each time point due to the intervals between notes fallingwithin each scaled kernels support. Repeated rhythmic figures, such as the short-ened semi-quaver (sixteenth note) followed by a crochet (quarter note), producesenergy ridges starting at 1.74, 3.6, 9.07, and 10.91 seconds in the scaleogram. Theseand other timing characteristics produce temporal structure which is revealed in thelower frequency scales. The phaseogram in Figure 10 indicates higher frequency pe-riodicities and the enduring lower frequency periodicity with an IOI mostly around0.91 seconds nominated in the next section as the tactus. The original rhythm,the computed beat track and the reconstructed phase s(t) of Equation 14 aredisplayed in Figure 11.

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    Time-Frequency Representation of Rhythm 19

    Figure 8. A rhythm that has a repeating IOI pattern of 0.305, 0.375, 0.375, 0.445 seconds. The period ofthe pattern (1.5 seconds) is shown on the scaleogram and is marked on the phaseogram as a black line.

    5. Conclusion

    This paper proposes and demonstrates phase preserving Morlet wavelets as a meansof analysing musical rhythm by revealing the rhythmic strata implicit in a rhyth-mic signal. The transform represents the rhythmic effects generated by dynamicand temporal accents in establishing hierarchies of rhythmic frequencies. Such ahierarchical representation compares with the metrical and grouping structure the-ory of Lerdahl and Jackendoff [8]. However it does not require explicit generativerules to produce such an interpretation. Hence the proposed method examines theinformation contained within a musical rhythm before any perceptual or cognitiveprocessing is performed. This attempts to make explicit the structure inherent ina rhythm signal. It can be viewed as a formalisation of rhythmic decomposition.In addition, the preservation of energy, and therefore invertability of the CWT

    (Section 2.1), enables cognitive models to be built in the time-frequency domain,as an alternative to purely time domain models. This allows representations to bebuilt that directly address the change over time that musical rhythm undergoes.The degree to which the time-frequency ridges produced by this model matcheshuman cognition suggests the degree to which musical rhythm is an example ofemergent cognition, arising from mechanisms that attune to features of environ-mental stimuli.Nevertheless, the system still has a number of limitations. The heuristic of choos-

    ing a tactus on the basis of the lowest frequency, longest extent is obviously sim-plistic. However, it does demonstrate that the CWT can identify the tactus as atime-frequency ridge. Work is currently underway to improve the tactus selectionmethod to include sensitivity to phase and global tempo. Further work is also re-

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    20 Smith and Honing

    Figure 9. A rhythm that has a repeating IOI pattern of 0.305, 0.375, 0.445 seconds. The slight timingvariation differences between this rhythm and that shown in Figure 8 are indicated in the different

    patterns of the high frequency scales, while a low energy ridge at the lower frequency scales matching thedifferent periods (1.125 seconds and 1.5, respectively) of the repeated patterns is visible on the

    magnitude diagram, and is plotted on the phase diagram.

    quired to extend the tactus selection to one that is determined by exposure to amusical environment and to verify this approach with a large test set that uses awide range of musical rhythms. Finally, the non-causal implementation currentlyprevents a direct online application and the frequency resolution used produces arelatively high computational burden.Despite these current issues, we think the multiresolution analysis model con-

    tributes to an understanding of how much information can be obtained from therhythmic signal itself, including both categorical (temporal structure) and contin-uous (expressive timing) information. Since it does not use additional top-downmodelling, it may serve as a baseline for cognitive models of rhythm perception.

    References[1] Desain, P. and Honing, H., 2004, Final Report NWO-PIONIER ProjectMusic, Mind, Machine. Tech-

    nical report X-2004-02, Institute for Logic, Language and Computation, Universiteit van Amsterdam.[2] Large, E.W. and Jones, M.R., 1999, The Dynamics of Attending: How People Track Time-Varying

    Events. Psychological Review, 106(1), 11959.[3] Honing, H., 2002, Structure and Interpretation of Rhythm and Timing. Tijdschrift voor Muziektheorie

    (Dutch Journal for Music Theory), 7(3), 22732.[4] Fraisse, P., 1982, Rhythm and Tempo. In: D. Deutsch (Ed.) The Psychology of Music (1st edn) (New

    York: Academic Press), pp. 14980.[5] Clarke, E.F., 1999, Rhythm and Timing in Music. In: D. Deutsch (Ed.) The Psychology of Music

    (second edn) (San Diego, CA: Academic Press), chap. 13, pp. 473500.[6] London, J., 2001, Rhythm. In: S. Sadie and J. Tyrrell (Eds) The New Grove Dictionary of Music and

    Musicians (2nd edn) (London, UK: Macmillan), pp. 277308.[7] Yeston, M., 1976 The Stratification of Musical Rhythm (New Haven: Yale University Press) (155p).[8] Lerdahl, F. and Jackendoff, R., 1983 A Generative Theory of Tonal Music (Cambridge, Mass: MIT

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    Time-Frequency Representation of Rhythm 21

    Figure 10. Rhythmic impulse function, scaleogram and phaseogram of a performed version ofGreensleeves possessing expressive timing modulations from the notated rhythm. The varying

    amplitudes of each beat are indicated by the density of events at the highest scales on the scaleogram.The tactus of the rhythm (derived using the algorithm described in Section 3.3) is shown as a black line

    on the phaseogram.

    Press) (368p.).[9] Jones, M.R. and Boltz, M., 1989, Dynamic Attending and Responses to Time. Psychological Review,

    96(3), 45991.[10] Desain, P. and Honing, H., 1999, Computational Models of Beat Induction: The Rule-Based Approach.

    Journal of New Music Research, 28(1), 2942.[11] Desain, P. and Honing, H., 2001, Modeling the Effect of Meter in Rhythmic Categorization: Preliminary

    Results. Japanese Journal of Music Perception and Cognition, 7(2), 14556.[12] Gouyon, F. and Dixon, S., 2005, A Review of Automatic Rhythm Description Systems. Computer

    Music Journal, 29(1), 3454.[13] Clarke, E.F., 1987, Levels of Structure in the Organization of Musical Time. Contemporary Music

    Review, 2(1), 21138.[14] Longuet-Higgins, H.C. and Lee, C.S., 1982, The Perception of Musical Rhythms. Perception, 11,

    11528.[15] Rosenthal, D.F., 1992, Machine Rhythm: Computer Emulation of Human Rhythm Perception. PhD

    thesis, MIT Media Lab, Cambridge, Mass.[16] Desain, P., 1992, A (De)Composable Theory of Rhythm Perception. Music Perception, 9(4), 43954.[17] Palmer, C. and Krumhansl, C.L., 1990, Mental Representations for Musical Meter. Journal of Exper-

    imental Psychology - Human Perception and Performance, 16(4), 72841.[18] Parncutt, R., 1994, A Model of Beat Induction Accounting for Perceptual Ambiguity by Continu-

    ously Variable Parameters. In: Proceedings of the Proceedings of the International Computer MusicConference, pp. 834.

    [19] Chung, J., 1989, An Agency for the Perception of Musical Beats, or, If I Only Had a Foot.... MITmedia laboratory report, Massachusetts Institute of Technology.

    [20] Large, E.W. and Kolen, J.F., 1994, Resonance and the Perception of Musical Meter. ConnectionScience, 6(2+3), 177208.

    [21] Desain, P. and Honing, H., 1989, The Quantization of Musical Time: A Connectionist Approach.Computer Music Journal, 13(3), 5666.

    [22] Roberts, S.C. and Greenhough, M., 1995, Rhythmic Pattern Processing Using A Self-Organising NeuralNetwork. In: Proceedings of the Proceedings of the International Computer Music Conference, pp. 4129.

    [23] Page, M.P., 1994, Modelling the Perception of Musical Sequences with Self-Organizing Neural Net-works. Connection Science, 6(2+3), 22346.

    [24] Todd, N.P., 1994, The Auditory Primal Sketch: A Multiscale Model of Rhythmic Grouping. Journal

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    22 Smith and Honing

    Figure 11. Demonstration of beat tracking to the tactus of the performance of Greensleeves analysedin Figure 10. The original time-points of the notes of the rhythm are shown, together with the

    reconstructed phase s(t) and the computed beat track determined from phase-locking to the secondnote, the nominated event to begin tapping to. In this diagram the computed beat track is artificially

    scaled to a value of 1.2, and the phase (0.01.0) to improve readability.

    of New Music Research, 23(1), 2570.[25] Scheirer, E.D., 1998, Tempo and Beat Analysis of Acoustic Musical Signals. Journal of the Acoustical

    Society of America, 103(1), 588601.[26] Desain, P. and de Vos, S., 1990, Autocorrelation and the Study of Musical Expression. In: Proceedings

    of the Proceedings of the International Computer Music Conference, pp. 357360.[27] Brown, J.C., 1993, Determination of the Meter of Musical Scores by Autocorrelation. Journal of the

    Acoustical Society of America, 94(4), 19537.[28] Goto, M. and Muraoka, Y., 1995, Music Understanding At The Beat Level Real-time Beat Track-

    ing For Audio Signals. In: Proceedings of the IJCAI-95 Workshop on Computational Auditory SceneAnalysis, pp. 6875.

    [29] Honing, H., 2005, Is there a perception-based alternative to kinematic models of tempo rubato?.MusicPerception, 23(1), 7985.

    [30] Sethares, W.A. and Staley, T.W., 2001, Meter and Periodicity in Musical Performance. Journal ofNew Music Research, 30(2), 14958.

    [31] Klapuri, A.P., Eronen, A.J. and Astola, J.T., 2006, Analysis of the Meter of Acoustic Musical Signals.IEEE Transactions on Audio, Speech and Language Processing, 14(1), 34255.

    [32] Dixon, S., 2001, Automatic Extraction of Tempo and Beat from Expressive Performances. Journal ofNew Music Research, 30(1), 3958.

    [33] Cemgil, A.T., Kappen, B., Desain, P. and Honing, H., 2000, On tempo tracking: Tempogram Repre-sentation and Kalman filtering. Journal of New Music Research, 29(4), 25973.

    [34] Marr, D. and Hildreth, E., 1980, Theory of Edge Detection. Proceedings of the Royal Society of LondonB, 207, 187217.

    [35] Todd, N.P.M., OBoyle, D.J. and Lee, C.S., 1999, A Sensory-Motor Theory of Rhythm, Time Percep-tion and Beat Induction. Journal of New Music Research, 28(1), 528.

    [36] Smith, L.M., 1996, Modelling Rhythm Perception by Continuous Time-Frequency Analysis. In: Pro-ceedings of the Proceedings of the International Computer Music Conference, pp. 3925.

    [37] Smith, L.M. and Kovesi, P., 1996, A Continuous Time-Frequency Approach To Representing RhythmicStrata. In: Proceedings of the Proceedings of the Fourth International Conference on Music Perceptionand Cognition, August, Montreal, Quebec, pp. 197202.

    [38] Smith, L.M., 1999, A Multiresolution Time-Frequency Analysis and Interpretation of Musical Rhythm.PhD thesis, Department of Computer Science, University of Western Australia.

    [39] Rioul, O. and Vetterli, M., 1991, Wavelets and Signal Processing. IEEE Signal Processing Magazine,8(4), 1438.

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    Time-Frequency Representation of Rhythm 23

    [40] Daubechies, I., 1992 Ten Lectures on Wavelets (Philadelphia, Pennsylvania: Society for Industrial andApplied Mathematics) (357p.).

    [41] Grossmann, A. and Morlet, J., 1984, Decomposition of Hardy Functions into Square IntegrableWavelets of Constant Shape. SIAM Journal on Mathematical Analysis, 15, 72336.

    [42] Mallat, S., 1998 A Wavelet Tour of Signal Processing (Academic Press) (577p.).[43] Kronland-Martinet, R. and Grossmann, A., 1991, Application of Time-Frequency and Time-Scale

    Methods (Wavelet Transforms) to the Analysis, Synthesis, and Transformation of Natural Sounds. In:G.D. Poli, A. Piccialli and C. Roads (Eds) Representations of Musical Signals (Cambridge, Mass: MITPress), pp. 4585.

    [44] Holschneider, M., 1995 Wavelets: An Analysis Tool (Clarendon Press) (423 p.).[45] Vetterli, M. and Herley, C., 1992, Wavelets and Filter Banks: Theory and Design. IEEE Transactions

    on Signal Processing, 40(9), 220732.[46] Boashash, B., 1990, Time-Frequency Signal Analysis. In: S. Haykin (Ed.) Advances in Spectrum Anal-

    ysis and Array Processing (Englewood Cliffs, NJ: Prentice-Hall), pp. 418517.[47] Gabor, D., 1946, Theory of Communication. IEE Proceedings, 93(3), 42957.[48] Kronland-Martinet, R., Morlet, J. and Grossmann, A., 1987, Analysis of Sound Patterns through

    Wavelet Transforms. International Journal of Pattern Recognition and Artificial Intelligence, 1(2),273302.

    [49] Kronland-Martinet, R., 1988, The Wavelet Transform for Analysis, Synthesis and Processing of Speechand Music Sounds. Computer Music Journal, 12(4), 1120 ((Sound examples on soundsheet with 13(1)1989)).

    [50] Grossmann, A., Kronland-Martinet, R. and Morlet, J., 1989, Reading and Understanding Continu-ous Wavelet Transforms. In: J. Combes, A. Grossmann and P. Tchamitchian (Eds) Wavelets (Berlin:Springer-Verlag), pp. 220.

    [51] Tchamitchian, P. and Torresani, B., 1992, Ridge and Skeleton Extraction From The Wavelet Trans-form. In: M.B. Ruskai (Ed.) Wavelets and Their Applications (Boston, Mass.: Jones and BartlettPublishers), pp. 12351.

    [52] Delprat, N., Escudie, B., Guillemain, P., Kronland-Martinet, R., Tchamitchian, P. and Torresani,B., 1992, Asymptotic Wavelet and Gabor Analysis: Extraction of Instantaneous Frequencies. IEEETransactions on Information Theory, 38(2), 64464.

    [53] Guillemain, P. and Kronland-Martinet, R., 1996, Characterization of Acoustic Signals Through Con-tinuous Linear Time-Frequency Representations. Proceedings of the IEEE, 84(4), 56185.

    [54] Morrone, M.C. and Owens, R.A., 1987, Feature Detection From Local Energy. Pattern RecognitionLetters, 6, 303313.

    [55] Kovesi, P., 1995, Image Features From Phase Congruency. Technical report 95/4, Computer ScienceDepartment, University of Western Australia.

    [56] Proakis, J.G. and Manolakis, D.G., 1992 Digital Signal Processing: Principles, Algorithms and Ap-plications, second (New York: Macmillan Publishing Co.) (969p).

    [57] Kohonen, T., 1995, Emergence of Invariant Feature Detectors in Self-Organization. In: M. Palaniswami,Y. Attikiouzel, R.J. Marks II, D. Fogel and T. Fukuda (Eds) Computational Intelligence: A DynamicSystem Perspective (New York: IEEE Press), chap. 2, pp. 1731.

    [58] Honing, H. and Ladinig, O., 2006, The Effect Of Exposure And Expertise On Timing Judgments:Preliminary Results. In: Proceedings of the Proceedings of the 9th International Conference on MusicPerception and Cognition (ICMPC)M. Baroni, A.R. Addessi, R. Caterina and M. Costa (Eds) , Bologna:Italy, pp. 805.

    [59] Jones, M.R., 1986, Attentional Rhythmicity in Human Perception. In: J.R. Evans and M. Clynes(Eds) Rhythm in Psychological, Linguistic, and Musical Processes (Springfield, Ill.: Charles ThomasPublishers), chap. 2, pp. 1340.

    [60] Parncutt, R., 1994, A Perceptual Model of Pulse Salience and Metrical Accent in Musical Rhythms.Music Perception, 11(4), 40964.

    [61] Todd, N.P., 1985, A Model of Expressive Timing in Tonal Music. Music Perception, 3(1), 3358.[62] Sorrell, N. and Narayan, R., 1980 Indian Music In Performance: A Practical Introduction (Manchester

    University Press) (190p.).[63] Bailey, D., 1992 Improvisation: Its Nature And Practice In Music, second (New York: Da Capo Press)

    (146p.).[64] Meyer, L.B., 1956 Emotion and Meaning in Music (University of Chicago Press) (307p.).[65] Huron, D., 2006 Sweet Anticipation: Music and the Psychology of Expectation (Cambridge, Mass:

    MIT Press).[66] Massaro, D.W., 1972, Preperceptual Images, Processing Time, and Perceptual Units in Auditory

    Perception. Psychological Review, 79(2), 12445.[67] Bregman, A.S., 1990 Auditory Scene Analysis (Cambridge, Mass: MIT Press) (773p.).[68] Allen, P.E. and Dannenberg, R.B., 1990, Tracking Musical Beats in Real Time. In: Proceedings of the

    Proceedings of the International Computer Music Conference, pp. 1403.[69] Grossmann, A., Holschneider, M., Kronland-Martinet, R. and Morlet, J., 1987, Detection of Abrupt

    Changes in Sound Signals with the Help of Wavelet Transforms. In: P.C. Sabatier (Ed.) Inverse Prob-lems: An Interdisciplinary Study; Advances in Electronics and Electron Physics, Supplement 19 (NewYork: Academic Press), pp. 289306.

    [70] Solbach, L., Wohrmann, R. and Kliewer, J., 1995, The Complex-Valued Continuous Wavelet Transformas a Preprocessor for Auditory Scene Analysis. In: Proceedings of the Working Notes of the Workshop onComputational Auditory Scene Analysis at the International joint Conference on Artificial Intelligence,August, Montreal, Canada.

    [71] Handel, S., 1989 Listening: An Introduction To The Perception Of Auditory Events (Cambridge, Mass:MIT Press) (597p.).

    [72] Povel, D.J., 1981, Internal Representation of Simple Temporal Patterns. Journal of ExperimentalPsychology - Human Perception and Performance, 7(1), 318.

  • August 14, 2007 14:52 Journal of Mathematics and Music MMPaper

    24 Smith and Honing

    [73] Longuet-Higgins, H.C., 1979, The Perception of Music. Proceedings of the Royal Society of London,205, 30722 (reprinted in [74]).

    [74] Longuet-Higgins, H.C., 1987 Mental Processes: Studies in Cognitive Science (Cambridge, Mass: MITPress) (424p.).