61
題目:應用頻譜及倒頻譜特徵之調變頻譜分析於 音樂風格之自動分類 Automatic music genre classification based on modulation spectral analysis of spectral and cepstral feature 別:資訊工程學系碩士班 學號姓名:M09502029 林懷三 指導教授:李 博士

Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

中 華 大 學

碩 士 論 文

題目應用頻譜及倒頻譜特徵之調變頻譜分析於

音樂風格之自動分類

Automatic music genre classification based on

modulation spectral analysis of spectral and

cepstral feature

系 所 別資訊工程學系碩士班

學號姓名M09502029 林懷三

指導教授李 建 興 博士

I

摘要

本論文提出了利用調變頻譜分析去觀察長時間的特徵變化進而從中擷取出

其對比特徵首先擷取出整首歌中代表每個音框的特徵向量 (此論文每個音框

擷取出的特徵有 MFCCOSC 與 MPEG-7 NASE)利用調變頻譜去分析音框之

間特徵的變化之後以不同的調變頻帶去切出每個調變頻帶的能量並擷取每個

頻帶的對比特徵在實驗過程中輸入測試的音樂訊號後擷取出所需的特徵後

經過線性正規劃與 LDA 降低維度之後利用歐基理德距離計算測試的訊號與每

一音樂類別的距離最後以距離最短者當作辨識的依據最後實驗數據可以明顯

看出利用調變頻譜分析擷取的特徵要優於傳統利用所有音框的平均向量與標準

差向量當作特徵而最高的辨識率為 8532

II

Abstract With the development of computer networks it becomes more and more popular

to purchase and download digital music from the Internet Since a typical music

database often contains millions of music tracks it is very difficult to manage such a

large music database So that it will be helpful in managing a vast amount of music

tracks when they are properly categorized Therefore a novel feature set derived from

modulation spectrum analysis of Mel-frequency cepstral coefficients (MFCC)

octave-based spectral contrast (OSC) and Normalized Audio Spectral Envelope

(NASE) is proposed for music genre classification The extracted features derived

from modulation spectrum analysis can capture the time-varying behavior of music

signals The experiments results show that the feature vector derived from modulation

spectrum analysis get better performance than that taking the mean and standard

derivation operations In addition apply statistical analysis to the feature values of the

modulation subbands can reduce the feature dimension efficiently The classification

accuracy can be further improved by using linear discriminant analysis (LDA)

whereas the feature dimension is reduced

III

致謝

在就讀研究所的過程中使我對於聲音訊號的領域有些微的了解更重要的

是讓我學會了研究的態度與方法持之恆心毅力與實事求是的精神而讓我理

解到這一深層意義的是我的指導老師 李建興 老師從老師的身上我學會了對

於研究的流程及看問題的角度也讓我感受到突破瓶頸那瞬間的愉悅在論文撰

寫的過程真的要感謝老師多次的閱讀與修改使我的論文順利完成另外也由

衷的感激老師在我期刊發表的期間幫我修正校稿直至半夜老師的辛勞學

生銘記與心此外也要感謝 連振昌 老師韓欽銓 老師石昭玲 老師 和周智

勳 老師在課業上與報告上的指導在此敬上深深的謝意

在此我也要特別感謝 吳翠霞 老師 與 Lilian 老師沒有她們的教導與幫

助就不會有現在的我同時我也要感謝學長 忠茂炳佑清乾正達昭

偉建程家銘與靈逸的教導以及同學 銘輝岳岷佐民雅麟與佑維的互

相支持與幫忙最後也要感謝學弟妹 勝斌正崙偉欣明修信吉琮瑋

仁政蘇峻雅婷佩蓉永坤致娟與堯文的陪伴不論是課業上或生活上都

有難以忘懷的回憶與美好的研究生活

最後要感謝的是我人生的導師在身邊默默支持的父親 林文鈴與細心照

顧我的母親 韓樹珍以及陪我長大的弟弟 懷志在這研究的壓力底下有你們

的鼓勵與支持使我能夠繼續走下去在此僅以此論文獻上我最深深的感謝

IV

CONTENT ABSTRACThelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipII CONTENTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipIV CHAPTER 1helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 INTRODUCTIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1

11 Motivationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 12 Review of music genre classification systemhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip2

121 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 1211 Short-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5

12111 Timbral featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 12112 Rhythmic featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7 12113 Pitch featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7

1212 Long-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12121 Mean and standard deviationhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12122 Autoregressive modelhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip9 12123 Modulation spectrum analysishelliphelliphelliphelliphelliphelliphelliphelliphellip9

122 Linear discriminant analysis (LDA)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10 123 Feature Classifierhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10

13 Outline of Thesishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 CHAPTER 2helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 THE PROPOSED MUSIC GENRE CLASSIFICATION SYSTEMhelliphelliphelliphelliphelliphellip13

21 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip14 211 Mel-Frequency Cepstral Coefficient (MFCC)helliphelliphelliphelliphelliphelliphelliphellip14 212 Octave-based Spectral Contrast (OSC)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip17 213 Normalized Audio Spectral Envelope (NASE)helliphelliphelliphelliphelliphelliphelliphellip19 214 Modulation Spectral Analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip23

2141 Modulation Spectral Contrast of MFCC (MMFCC)helliphelliphellip23 2142 Modulation Spectral Contrast of OSC (MOSC)helliphelliphelliphellip25 2143 Modulation Spectral Contrast of NASE (MASE)helliphelliphelliphellip27

215 Statistical Aggregation of Modulation Spectral Feature valueshellip30 2151 Statistical Aggregation of MMFCC (SMMFCC)helliphelliphelliphellip30 2152 Statistical Aggregation of MOSC (SMOSC)helliphelliphelliphelliphelliphellip32 2153 Statistical Aggregation of MASE (SMASE)helliphelliphelliphelliphelliphellip33

216 Feature vector normalizationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip35 22 Linear discriminant analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip36 23 Music Genre Classificaiton phasehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

V

CHAPTER 3helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39 EXPERIMENT RESULTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

31 Comparison of row-based modulation spectral feature vectorhelliphelliphelliphellip40 32 Comparison of column-based modulation spectral feature vectorhelliphelliphellip42 33 Combination of row-based and column-based modulation spectral feature

vectorshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45 CHAPTER 4helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 CONCLUSIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 REFERENCEhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

1

Chapter 1

Introduction

11 Motivation

With the development of computer networks it becomes more and more popular

to purchase and download digital music from the Internet However a general music

database often contains millions of music tracks Hence it is very difficult to manage

such a large digital music database For this reason it will be helpful to manage a vast

amount of music tracks when they are properly categorized in advance In general the

retail or online music stores often organize their collections of music tracks by

categories such as genre artist and album Usually the category information of a

music track is manually labeled by experienced managers But to determine the

music genre of a music track by experienced managers is a laborious and

time-consuming work Therefore a number of supervised classification techniques

have been developed for automatic classification of unlabeled music tracks

[1-11]Thus in the study we focus on the music genre classification problem which is

defined as genre labeling of music tracks So that an automatic music genre

classification plays an important and preliminary role in music information retrieval

systems A new album or music track can be assigned to a proper genre in order to

place it in the appropriate section of an online music store or music database

To classify the music genre of a given music track some discriminating audio

features have to be extracted through content-based analysis of the music signal In

addition many studies try to examine a set of classifiers to improve the classification

performance However the improvement is limited and ineffective In fact employing

effective feature sets will have much more useful on the classification accuracy than

2

selecting a specific classifier [12] In the study a novel feature set derived from the

row-based and the column-based modulation spectrum analysis will be proposed for

automatic music genre classification

12 Review of Music Genre Classification Systems

The fundamental problem of a music genre classification system is to determine

the structure of the taxonomy that music pieces will be classified into However it is

hard to clearly define a universally agreed structure In general exploiting

hierarchical taxonomy structure for music genre classification has some merits (1)

People often prefer to search music by browsing the hierarchical catalogs (2)

Taxonomy structures identify the relationships or dependence between the music

genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification

approach to improve the classification efficiency and accuracy (3) The classification

errors become more acceptable by using taxonomy than direct music genre

classification The coarse-to-fine approach can make the classification errors

concentrate on a given level of the hierarchy

Burred and Lerch [13] have developed a hierarchical taxonomy for music genre

classification as shown in Fig 11 Rather than making a single decision to classify a

given music into one of all music genres (direct approach) the hierarchical approach

makes successive decisions at each branch point of the taxonomy hierarchy

Additionally appropriate and variant features can be employed at each branch point

of the taxonomy Therefore the hierarchical classification approach allows the

managers to trace at which level the classification errors occur frequently Barbedo

and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The

hierarchical structure was constructed in the bottom-up structure in stead of the

3

top-down structure This is because that it is easily to merge leaf classes into the same

parent class in the bottom-up structure Therefore the upper layer can be easily

constructed In their experiment result the classification accuracy which used the

hierarchical bottom-up approach outperforms the top-down approach by about 3 -

5

Li and Ogihara [15] investigated the effect of two different taxonomy structures

for music genre classification They also proposed an approach to automatic

generation of music genre taxonomies based on the confusion matrix computed by

linear discriminant projection This approach can reduce the time-consuming and

expensive task for manual construction of taxonomies It also helps to look for music

collections in which there are no natural taxonomies [16] According to a given genre

taxonomy many different approaches have been proposed to classify the music genre

for raw music tracks In general a music genre classification system consists of three

major aspects feature extraction feature selection and feature classification Fig 13

shows the block diagram of a music genre classification system

Fig 11 A hierarchical audio taxonomy

4

Fig 12 A hierarchical audio taxonomy

Fig 13 A music genre classification system

5

121 Feature Extraction

1211 Short-term Features

The most important aspect of music genre classification is to determine which

features are relevant and how to extract them Tzanetakis and Cook [1] employed

three feature sets including timbral texture rhythmic content and pitch content to

classify audio collections by their musical genres

12111 Timbral features

Timbral features are generally characterized by the properties related to

instrumentations or sound sources such as music speech or environment signals The

features used to represent timbral texture are described as follows

(1) Low-energy Feature it is defined as the percentage of analysis windows that

have RMS energy less than the average RMS energy across the texture window The

size of texture window should correspond to the minimum amount of time required to

identify a particular music texture

(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It

is defined as

])1[(])[(21 1

0summinus

=

minusminus=N

ntt nxsignnxsignZCR

where the sign function will return 1 for positive input and 0 for negative input and

xt[n] is the time domain signal for frame t

(3) Spectral Centroid spectral centroid is defined as the center of gravity of the

magnitude spectrum

][

][

1

1

sum

sum

=

=

times= N

nt

N

nt

t

nM

nnMC

6

where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the

magnitude of the n-th frequency bin of the t-th frame

(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of

the signal

( )

sum

sum

=

=

timesminus= N

nt

N

ntt

t

nM

nMCnSB

1

1

2

][

][

(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as

the frequency Rt below which 85 of the magnitude distribution is concentrated

sum sum=

minus

=

timesletR

k

N

k

kSkS0

1

0

][850][

(6) Spectral Flux The spectral flux measures the amount of local spectral change It

is defined as the squared difference between the normalized magnitudes of successive

spectral distributions

])[][(1

0

21sum

minus

=minusminus=

N

kttt kNkNSF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and

the (t-1)-th frame respectively

(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech

recognition due to their ability to represent the speech spectrum in a compact form In

human auditory system the perceived pitch is not linear with respect to the physical

frequency of the corresponding tone The mapping between the physical frequency

scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz

and logarithmic at higher frequencies In fact MFCC have been proven to be very

effective in automatic speech recognition and in modeling the subjective frequency

content of audio signals

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 2: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

I

摘要

本論文提出了利用調變頻譜分析去觀察長時間的特徵變化進而從中擷取出

其對比特徵首先擷取出整首歌中代表每個音框的特徵向量 (此論文每個音框

擷取出的特徵有 MFCCOSC 與 MPEG-7 NASE)利用調變頻譜去分析音框之

間特徵的變化之後以不同的調變頻帶去切出每個調變頻帶的能量並擷取每個

頻帶的對比特徵在實驗過程中輸入測試的音樂訊號後擷取出所需的特徵後

經過線性正規劃與 LDA 降低維度之後利用歐基理德距離計算測試的訊號與每

一音樂類別的距離最後以距離最短者當作辨識的依據最後實驗數據可以明顯

看出利用調變頻譜分析擷取的特徵要優於傳統利用所有音框的平均向量與標準

差向量當作特徵而最高的辨識率為 8532

II

Abstract With the development of computer networks it becomes more and more popular

to purchase and download digital music from the Internet Since a typical music

database often contains millions of music tracks it is very difficult to manage such a

large music database So that it will be helpful in managing a vast amount of music

tracks when they are properly categorized Therefore a novel feature set derived from

modulation spectrum analysis of Mel-frequency cepstral coefficients (MFCC)

octave-based spectral contrast (OSC) and Normalized Audio Spectral Envelope

(NASE) is proposed for music genre classification The extracted features derived

from modulation spectrum analysis can capture the time-varying behavior of music

signals The experiments results show that the feature vector derived from modulation

spectrum analysis get better performance than that taking the mean and standard

derivation operations In addition apply statistical analysis to the feature values of the

modulation subbands can reduce the feature dimension efficiently The classification

accuracy can be further improved by using linear discriminant analysis (LDA)

whereas the feature dimension is reduced

III

致謝

在就讀研究所的過程中使我對於聲音訊號的領域有些微的了解更重要的

是讓我學會了研究的態度與方法持之恆心毅力與實事求是的精神而讓我理

解到這一深層意義的是我的指導老師 李建興 老師從老師的身上我學會了對

於研究的流程及看問題的角度也讓我感受到突破瓶頸那瞬間的愉悅在論文撰

寫的過程真的要感謝老師多次的閱讀與修改使我的論文順利完成另外也由

衷的感激老師在我期刊發表的期間幫我修正校稿直至半夜老師的辛勞學

生銘記與心此外也要感謝 連振昌 老師韓欽銓 老師石昭玲 老師 和周智

勳 老師在課業上與報告上的指導在此敬上深深的謝意

在此我也要特別感謝 吳翠霞 老師 與 Lilian 老師沒有她們的教導與幫

助就不會有現在的我同時我也要感謝學長 忠茂炳佑清乾正達昭

偉建程家銘與靈逸的教導以及同學 銘輝岳岷佐民雅麟與佑維的互

相支持與幫忙最後也要感謝學弟妹 勝斌正崙偉欣明修信吉琮瑋

仁政蘇峻雅婷佩蓉永坤致娟與堯文的陪伴不論是課業上或生活上都

有難以忘懷的回憶與美好的研究生活

最後要感謝的是我人生的導師在身邊默默支持的父親 林文鈴與細心照

顧我的母親 韓樹珍以及陪我長大的弟弟 懷志在這研究的壓力底下有你們

的鼓勵與支持使我能夠繼續走下去在此僅以此論文獻上我最深深的感謝

IV

CONTENT ABSTRACThelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipII CONTENTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipIV CHAPTER 1helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 INTRODUCTIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1

11 Motivationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 12 Review of music genre classification systemhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip2

121 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 1211 Short-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5

12111 Timbral featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 12112 Rhythmic featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7 12113 Pitch featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7

1212 Long-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12121 Mean and standard deviationhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12122 Autoregressive modelhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip9 12123 Modulation spectrum analysishelliphelliphelliphelliphelliphelliphelliphelliphellip9

122 Linear discriminant analysis (LDA)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10 123 Feature Classifierhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10

13 Outline of Thesishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 CHAPTER 2helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 THE PROPOSED MUSIC GENRE CLASSIFICATION SYSTEMhelliphelliphelliphelliphelliphellip13

21 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip14 211 Mel-Frequency Cepstral Coefficient (MFCC)helliphelliphelliphelliphelliphelliphelliphellip14 212 Octave-based Spectral Contrast (OSC)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip17 213 Normalized Audio Spectral Envelope (NASE)helliphelliphelliphelliphelliphelliphelliphellip19 214 Modulation Spectral Analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip23

2141 Modulation Spectral Contrast of MFCC (MMFCC)helliphelliphellip23 2142 Modulation Spectral Contrast of OSC (MOSC)helliphelliphelliphellip25 2143 Modulation Spectral Contrast of NASE (MASE)helliphelliphelliphellip27

215 Statistical Aggregation of Modulation Spectral Feature valueshellip30 2151 Statistical Aggregation of MMFCC (SMMFCC)helliphelliphelliphellip30 2152 Statistical Aggregation of MOSC (SMOSC)helliphelliphelliphelliphelliphellip32 2153 Statistical Aggregation of MASE (SMASE)helliphelliphelliphelliphelliphellip33

216 Feature vector normalizationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip35 22 Linear discriminant analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip36 23 Music Genre Classificaiton phasehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

V

CHAPTER 3helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39 EXPERIMENT RESULTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

31 Comparison of row-based modulation spectral feature vectorhelliphelliphelliphellip40 32 Comparison of column-based modulation spectral feature vectorhelliphelliphellip42 33 Combination of row-based and column-based modulation spectral feature

vectorshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45 CHAPTER 4helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 CONCLUSIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 REFERENCEhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

1

Chapter 1

Introduction

11 Motivation

With the development of computer networks it becomes more and more popular

to purchase and download digital music from the Internet However a general music

database often contains millions of music tracks Hence it is very difficult to manage

such a large digital music database For this reason it will be helpful to manage a vast

amount of music tracks when they are properly categorized in advance In general the

retail or online music stores often organize their collections of music tracks by

categories such as genre artist and album Usually the category information of a

music track is manually labeled by experienced managers But to determine the

music genre of a music track by experienced managers is a laborious and

time-consuming work Therefore a number of supervised classification techniques

have been developed for automatic classification of unlabeled music tracks

[1-11]Thus in the study we focus on the music genre classification problem which is

defined as genre labeling of music tracks So that an automatic music genre

classification plays an important and preliminary role in music information retrieval

systems A new album or music track can be assigned to a proper genre in order to

place it in the appropriate section of an online music store or music database

To classify the music genre of a given music track some discriminating audio

features have to be extracted through content-based analysis of the music signal In

addition many studies try to examine a set of classifiers to improve the classification

performance However the improvement is limited and ineffective In fact employing

effective feature sets will have much more useful on the classification accuracy than

2

selecting a specific classifier [12] In the study a novel feature set derived from the

row-based and the column-based modulation spectrum analysis will be proposed for

automatic music genre classification

12 Review of Music Genre Classification Systems

The fundamental problem of a music genre classification system is to determine

the structure of the taxonomy that music pieces will be classified into However it is

hard to clearly define a universally agreed structure In general exploiting

hierarchical taxonomy structure for music genre classification has some merits (1)

People often prefer to search music by browsing the hierarchical catalogs (2)

Taxonomy structures identify the relationships or dependence between the music

genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification

approach to improve the classification efficiency and accuracy (3) The classification

errors become more acceptable by using taxonomy than direct music genre

classification The coarse-to-fine approach can make the classification errors

concentrate on a given level of the hierarchy

Burred and Lerch [13] have developed a hierarchical taxonomy for music genre

classification as shown in Fig 11 Rather than making a single decision to classify a

given music into one of all music genres (direct approach) the hierarchical approach

makes successive decisions at each branch point of the taxonomy hierarchy

Additionally appropriate and variant features can be employed at each branch point

of the taxonomy Therefore the hierarchical classification approach allows the

managers to trace at which level the classification errors occur frequently Barbedo

and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The

hierarchical structure was constructed in the bottom-up structure in stead of the

3

top-down structure This is because that it is easily to merge leaf classes into the same

parent class in the bottom-up structure Therefore the upper layer can be easily

constructed In their experiment result the classification accuracy which used the

hierarchical bottom-up approach outperforms the top-down approach by about 3 -

5

Li and Ogihara [15] investigated the effect of two different taxonomy structures

for music genre classification They also proposed an approach to automatic

generation of music genre taxonomies based on the confusion matrix computed by

linear discriminant projection This approach can reduce the time-consuming and

expensive task for manual construction of taxonomies It also helps to look for music

collections in which there are no natural taxonomies [16] According to a given genre

taxonomy many different approaches have been proposed to classify the music genre

for raw music tracks In general a music genre classification system consists of three

major aspects feature extraction feature selection and feature classification Fig 13

shows the block diagram of a music genre classification system

Fig 11 A hierarchical audio taxonomy

4

Fig 12 A hierarchical audio taxonomy

Fig 13 A music genre classification system

5

121 Feature Extraction

1211 Short-term Features

The most important aspect of music genre classification is to determine which

features are relevant and how to extract them Tzanetakis and Cook [1] employed

three feature sets including timbral texture rhythmic content and pitch content to

classify audio collections by their musical genres

12111 Timbral features

Timbral features are generally characterized by the properties related to

instrumentations or sound sources such as music speech or environment signals The

features used to represent timbral texture are described as follows

(1) Low-energy Feature it is defined as the percentage of analysis windows that

have RMS energy less than the average RMS energy across the texture window The

size of texture window should correspond to the minimum amount of time required to

identify a particular music texture

(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It

is defined as

])1[(])[(21 1

0summinus

=

minusminus=N

ntt nxsignnxsignZCR

where the sign function will return 1 for positive input and 0 for negative input and

xt[n] is the time domain signal for frame t

(3) Spectral Centroid spectral centroid is defined as the center of gravity of the

magnitude spectrum

][

][

1

1

sum

sum

=

=

times= N

nt

N

nt

t

nM

nnMC

6

where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the

magnitude of the n-th frequency bin of the t-th frame

(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of

the signal

( )

sum

sum

=

=

timesminus= N

nt

N

ntt

t

nM

nMCnSB

1

1

2

][

][

(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as

the frequency Rt below which 85 of the magnitude distribution is concentrated

sum sum=

minus

=

timesletR

k

N

k

kSkS0

1

0

][850][

(6) Spectral Flux The spectral flux measures the amount of local spectral change It

is defined as the squared difference between the normalized magnitudes of successive

spectral distributions

])[][(1

0

21sum

minus

=minusminus=

N

kttt kNkNSF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and

the (t-1)-th frame respectively

(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech

recognition due to their ability to represent the speech spectrum in a compact form In

human auditory system the perceived pitch is not linear with respect to the physical

frequency of the corresponding tone The mapping between the physical frequency

scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz

and logarithmic at higher frequencies In fact MFCC have been proven to be very

effective in automatic speech recognition and in modeling the subjective frequency

content of audio signals

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 3: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

II

Abstract With the development of computer networks it becomes more and more popular

to purchase and download digital music from the Internet Since a typical music

database often contains millions of music tracks it is very difficult to manage such a

large music database So that it will be helpful in managing a vast amount of music

tracks when they are properly categorized Therefore a novel feature set derived from

modulation spectrum analysis of Mel-frequency cepstral coefficients (MFCC)

octave-based spectral contrast (OSC) and Normalized Audio Spectral Envelope

(NASE) is proposed for music genre classification The extracted features derived

from modulation spectrum analysis can capture the time-varying behavior of music

signals The experiments results show that the feature vector derived from modulation

spectrum analysis get better performance than that taking the mean and standard

derivation operations In addition apply statistical analysis to the feature values of the

modulation subbands can reduce the feature dimension efficiently The classification

accuracy can be further improved by using linear discriminant analysis (LDA)

whereas the feature dimension is reduced

III

致謝

在就讀研究所的過程中使我對於聲音訊號的領域有些微的了解更重要的

是讓我學會了研究的態度與方法持之恆心毅力與實事求是的精神而讓我理

解到這一深層意義的是我的指導老師 李建興 老師從老師的身上我學會了對

於研究的流程及看問題的角度也讓我感受到突破瓶頸那瞬間的愉悅在論文撰

寫的過程真的要感謝老師多次的閱讀與修改使我的論文順利完成另外也由

衷的感激老師在我期刊發表的期間幫我修正校稿直至半夜老師的辛勞學

生銘記與心此外也要感謝 連振昌 老師韓欽銓 老師石昭玲 老師 和周智

勳 老師在課業上與報告上的指導在此敬上深深的謝意

在此我也要特別感謝 吳翠霞 老師 與 Lilian 老師沒有她們的教導與幫

助就不會有現在的我同時我也要感謝學長 忠茂炳佑清乾正達昭

偉建程家銘與靈逸的教導以及同學 銘輝岳岷佐民雅麟與佑維的互

相支持與幫忙最後也要感謝學弟妹 勝斌正崙偉欣明修信吉琮瑋

仁政蘇峻雅婷佩蓉永坤致娟與堯文的陪伴不論是課業上或生活上都

有難以忘懷的回憶與美好的研究生活

最後要感謝的是我人生的導師在身邊默默支持的父親 林文鈴與細心照

顧我的母親 韓樹珍以及陪我長大的弟弟 懷志在這研究的壓力底下有你們

的鼓勵與支持使我能夠繼續走下去在此僅以此論文獻上我最深深的感謝

IV

CONTENT ABSTRACThelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipII CONTENTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipIV CHAPTER 1helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 INTRODUCTIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1

11 Motivationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 12 Review of music genre classification systemhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip2

121 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 1211 Short-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5

12111 Timbral featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 12112 Rhythmic featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7 12113 Pitch featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7

1212 Long-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12121 Mean and standard deviationhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12122 Autoregressive modelhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip9 12123 Modulation spectrum analysishelliphelliphelliphelliphelliphelliphelliphelliphellip9

122 Linear discriminant analysis (LDA)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10 123 Feature Classifierhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10

13 Outline of Thesishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 CHAPTER 2helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 THE PROPOSED MUSIC GENRE CLASSIFICATION SYSTEMhelliphelliphelliphelliphelliphellip13

21 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip14 211 Mel-Frequency Cepstral Coefficient (MFCC)helliphelliphelliphelliphelliphelliphelliphellip14 212 Octave-based Spectral Contrast (OSC)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip17 213 Normalized Audio Spectral Envelope (NASE)helliphelliphelliphelliphelliphelliphelliphellip19 214 Modulation Spectral Analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip23

2141 Modulation Spectral Contrast of MFCC (MMFCC)helliphelliphellip23 2142 Modulation Spectral Contrast of OSC (MOSC)helliphelliphelliphellip25 2143 Modulation Spectral Contrast of NASE (MASE)helliphelliphelliphellip27

215 Statistical Aggregation of Modulation Spectral Feature valueshellip30 2151 Statistical Aggregation of MMFCC (SMMFCC)helliphelliphelliphellip30 2152 Statistical Aggregation of MOSC (SMOSC)helliphelliphelliphelliphelliphellip32 2153 Statistical Aggregation of MASE (SMASE)helliphelliphelliphelliphelliphellip33

216 Feature vector normalizationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip35 22 Linear discriminant analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip36 23 Music Genre Classificaiton phasehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

V

CHAPTER 3helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39 EXPERIMENT RESULTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

31 Comparison of row-based modulation spectral feature vectorhelliphelliphelliphellip40 32 Comparison of column-based modulation spectral feature vectorhelliphelliphellip42 33 Combination of row-based and column-based modulation spectral feature

vectorshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45 CHAPTER 4helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 CONCLUSIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 REFERENCEhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

1

Chapter 1

Introduction

11 Motivation

With the development of computer networks it becomes more and more popular

to purchase and download digital music from the Internet However a general music

database often contains millions of music tracks Hence it is very difficult to manage

such a large digital music database For this reason it will be helpful to manage a vast

amount of music tracks when they are properly categorized in advance In general the

retail or online music stores often organize their collections of music tracks by

categories such as genre artist and album Usually the category information of a

music track is manually labeled by experienced managers But to determine the

music genre of a music track by experienced managers is a laborious and

time-consuming work Therefore a number of supervised classification techniques

have been developed for automatic classification of unlabeled music tracks

[1-11]Thus in the study we focus on the music genre classification problem which is

defined as genre labeling of music tracks So that an automatic music genre

classification plays an important and preliminary role in music information retrieval

systems A new album or music track can be assigned to a proper genre in order to

place it in the appropriate section of an online music store or music database

To classify the music genre of a given music track some discriminating audio

features have to be extracted through content-based analysis of the music signal In

addition many studies try to examine a set of classifiers to improve the classification

performance However the improvement is limited and ineffective In fact employing

effective feature sets will have much more useful on the classification accuracy than

2

selecting a specific classifier [12] In the study a novel feature set derived from the

row-based and the column-based modulation spectrum analysis will be proposed for

automatic music genre classification

12 Review of Music Genre Classification Systems

The fundamental problem of a music genre classification system is to determine

the structure of the taxonomy that music pieces will be classified into However it is

hard to clearly define a universally agreed structure In general exploiting

hierarchical taxonomy structure for music genre classification has some merits (1)

People often prefer to search music by browsing the hierarchical catalogs (2)

Taxonomy structures identify the relationships or dependence between the music

genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification

approach to improve the classification efficiency and accuracy (3) The classification

errors become more acceptable by using taxonomy than direct music genre

classification The coarse-to-fine approach can make the classification errors

concentrate on a given level of the hierarchy

Burred and Lerch [13] have developed a hierarchical taxonomy for music genre

classification as shown in Fig 11 Rather than making a single decision to classify a

given music into one of all music genres (direct approach) the hierarchical approach

makes successive decisions at each branch point of the taxonomy hierarchy

Additionally appropriate and variant features can be employed at each branch point

of the taxonomy Therefore the hierarchical classification approach allows the

managers to trace at which level the classification errors occur frequently Barbedo

and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The

hierarchical structure was constructed in the bottom-up structure in stead of the

3

top-down structure This is because that it is easily to merge leaf classes into the same

parent class in the bottom-up structure Therefore the upper layer can be easily

constructed In their experiment result the classification accuracy which used the

hierarchical bottom-up approach outperforms the top-down approach by about 3 -

5

Li and Ogihara [15] investigated the effect of two different taxonomy structures

for music genre classification They also proposed an approach to automatic

generation of music genre taxonomies based on the confusion matrix computed by

linear discriminant projection This approach can reduce the time-consuming and

expensive task for manual construction of taxonomies It also helps to look for music

collections in which there are no natural taxonomies [16] According to a given genre

taxonomy many different approaches have been proposed to classify the music genre

for raw music tracks In general a music genre classification system consists of three

major aspects feature extraction feature selection and feature classification Fig 13

shows the block diagram of a music genre classification system

Fig 11 A hierarchical audio taxonomy

4

Fig 12 A hierarchical audio taxonomy

Fig 13 A music genre classification system

5

121 Feature Extraction

1211 Short-term Features

The most important aspect of music genre classification is to determine which

features are relevant and how to extract them Tzanetakis and Cook [1] employed

three feature sets including timbral texture rhythmic content and pitch content to

classify audio collections by their musical genres

12111 Timbral features

Timbral features are generally characterized by the properties related to

instrumentations or sound sources such as music speech or environment signals The

features used to represent timbral texture are described as follows

(1) Low-energy Feature it is defined as the percentage of analysis windows that

have RMS energy less than the average RMS energy across the texture window The

size of texture window should correspond to the minimum amount of time required to

identify a particular music texture

(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It

is defined as

])1[(])[(21 1

0summinus

=

minusminus=N

ntt nxsignnxsignZCR

where the sign function will return 1 for positive input and 0 for negative input and

xt[n] is the time domain signal for frame t

(3) Spectral Centroid spectral centroid is defined as the center of gravity of the

magnitude spectrum

][

][

1

1

sum

sum

=

=

times= N

nt

N

nt

t

nM

nnMC

6

where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the

magnitude of the n-th frequency bin of the t-th frame

(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of

the signal

( )

sum

sum

=

=

timesminus= N

nt

N

ntt

t

nM

nMCnSB

1

1

2

][

][

(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as

the frequency Rt below which 85 of the magnitude distribution is concentrated

sum sum=

minus

=

timesletR

k

N

k

kSkS0

1

0

][850][

(6) Spectral Flux The spectral flux measures the amount of local spectral change It

is defined as the squared difference between the normalized magnitudes of successive

spectral distributions

])[][(1

0

21sum

minus

=minusminus=

N

kttt kNkNSF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and

the (t-1)-th frame respectively

(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech

recognition due to their ability to represent the speech spectrum in a compact form In

human auditory system the perceived pitch is not linear with respect to the physical

frequency of the corresponding tone The mapping between the physical frequency

scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz

and logarithmic at higher frequencies In fact MFCC have been proven to be very

effective in automatic speech recognition and in modeling the subjective frequency

content of audio signals

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 4: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

III

致謝

在就讀研究所的過程中使我對於聲音訊號的領域有些微的了解更重要的

是讓我學會了研究的態度與方法持之恆心毅力與實事求是的精神而讓我理

解到這一深層意義的是我的指導老師 李建興 老師從老師的身上我學會了對

於研究的流程及看問題的角度也讓我感受到突破瓶頸那瞬間的愉悅在論文撰

寫的過程真的要感謝老師多次的閱讀與修改使我的論文順利完成另外也由

衷的感激老師在我期刊發表的期間幫我修正校稿直至半夜老師的辛勞學

生銘記與心此外也要感謝 連振昌 老師韓欽銓 老師石昭玲 老師 和周智

勳 老師在課業上與報告上的指導在此敬上深深的謝意

在此我也要特別感謝 吳翠霞 老師 與 Lilian 老師沒有她們的教導與幫

助就不會有現在的我同時我也要感謝學長 忠茂炳佑清乾正達昭

偉建程家銘與靈逸的教導以及同學 銘輝岳岷佐民雅麟與佑維的互

相支持與幫忙最後也要感謝學弟妹 勝斌正崙偉欣明修信吉琮瑋

仁政蘇峻雅婷佩蓉永坤致娟與堯文的陪伴不論是課業上或生活上都

有難以忘懷的回憶與美好的研究生活

最後要感謝的是我人生的導師在身邊默默支持的父親 林文鈴與細心照

顧我的母親 韓樹珍以及陪我長大的弟弟 懷志在這研究的壓力底下有你們

的鼓勵與支持使我能夠繼續走下去在此僅以此論文獻上我最深深的感謝

IV

CONTENT ABSTRACThelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipII CONTENTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipIV CHAPTER 1helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 INTRODUCTIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1

11 Motivationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 12 Review of music genre classification systemhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip2

121 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 1211 Short-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5

12111 Timbral featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 12112 Rhythmic featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7 12113 Pitch featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7

1212 Long-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12121 Mean and standard deviationhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12122 Autoregressive modelhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip9 12123 Modulation spectrum analysishelliphelliphelliphelliphelliphelliphelliphelliphellip9

122 Linear discriminant analysis (LDA)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10 123 Feature Classifierhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10

13 Outline of Thesishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 CHAPTER 2helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 THE PROPOSED MUSIC GENRE CLASSIFICATION SYSTEMhelliphelliphelliphelliphelliphellip13

21 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip14 211 Mel-Frequency Cepstral Coefficient (MFCC)helliphelliphelliphelliphelliphelliphelliphellip14 212 Octave-based Spectral Contrast (OSC)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip17 213 Normalized Audio Spectral Envelope (NASE)helliphelliphelliphelliphelliphelliphelliphellip19 214 Modulation Spectral Analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip23

2141 Modulation Spectral Contrast of MFCC (MMFCC)helliphelliphellip23 2142 Modulation Spectral Contrast of OSC (MOSC)helliphelliphelliphellip25 2143 Modulation Spectral Contrast of NASE (MASE)helliphelliphelliphellip27

215 Statistical Aggregation of Modulation Spectral Feature valueshellip30 2151 Statistical Aggregation of MMFCC (SMMFCC)helliphelliphelliphellip30 2152 Statistical Aggregation of MOSC (SMOSC)helliphelliphelliphelliphelliphellip32 2153 Statistical Aggregation of MASE (SMASE)helliphelliphelliphelliphelliphellip33

216 Feature vector normalizationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip35 22 Linear discriminant analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip36 23 Music Genre Classificaiton phasehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

V

CHAPTER 3helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39 EXPERIMENT RESULTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

31 Comparison of row-based modulation spectral feature vectorhelliphelliphelliphellip40 32 Comparison of column-based modulation spectral feature vectorhelliphelliphellip42 33 Combination of row-based and column-based modulation spectral feature

vectorshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45 CHAPTER 4helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 CONCLUSIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 REFERENCEhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

1

Chapter 1

Introduction

11 Motivation

With the development of computer networks it becomes more and more popular

to purchase and download digital music from the Internet However a general music

database often contains millions of music tracks Hence it is very difficult to manage

such a large digital music database For this reason it will be helpful to manage a vast

amount of music tracks when they are properly categorized in advance In general the

retail or online music stores often organize their collections of music tracks by

categories such as genre artist and album Usually the category information of a

music track is manually labeled by experienced managers But to determine the

music genre of a music track by experienced managers is a laborious and

time-consuming work Therefore a number of supervised classification techniques

have been developed for automatic classification of unlabeled music tracks

[1-11]Thus in the study we focus on the music genre classification problem which is

defined as genre labeling of music tracks So that an automatic music genre

classification plays an important and preliminary role in music information retrieval

systems A new album or music track can be assigned to a proper genre in order to

place it in the appropriate section of an online music store or music database

To classify the music genre of a given music track some discriminating audio

features have to be extracted through content-based analysis of the music signal In

addition many studies try to examine a set of classifiers to improve the classification

performance However the improvement is limited and ineffective In fact employing

effective feature sets will have much more useful on the classification accuracy than

2

selecting a specific classifier [12] In the study a novel feature set derived from the

row-based and the column-based modulation spectrum analysis will be proposed for

automatic music genre classification

12 Review of Music Genre Classification Systems

The fundamental problem of a music genre classification system is to determine

the structure of the taxonomy that music pieces will be classified into However it is

hard to clearly define a universally agreed structure In general exploiting

hierarchical taxonomy structure for music genre classification has some merits (1)

People often prefer to search music by browsing the hierarchical catalogs (2)

Taxonomy structures identify the relationships or dependence between the music

genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification

approach to improve the classification efficiency and accuracy (3) The classification

errors become more acceptable by using taxonomy than direct music genre

classification The coarse-to-fine approach can make the classification errors

concentrate on a given level of the hierarchy

Burred and Lerch [13] have developed a hierarchical taxonomy for music genre

classification as shown in Fig 11 Rather than making a single decision to classify a

given music into one of all music genres (direct approach) the hierarchical approach

makes successive decisions at each branch point of the taxonomy hierarchy

Additionally appropriate and variant features can be employed at each branch point

of the taxonomy Therefore the hierarchical classification approach allows the

managers to trace at which level the classification errors occur frequently Barbedo

and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The

hierarchical structure was constructed in the bottom-up structure in stead of the

3

top-down structure This is because that it is easily to merge leaf classes into the same

parent class in the bottom-up structure Therefore the upper layer can be easily

constructed In their experiment result the classification accuracy which used the

hierarchical bottom-up approach outperforms the top-down approach by about 3 -

5

Li and Ogihara [15] investigated the effect of two different taxonomy structures

for music genre classification They also proposed an approach to automatic

generation of music genre taxonomies based on the confusion matrix computed by

linear discriminant projection This approach can reduce the time-consuming and

expensive task for manual construction of taxonomies It also helps to look for music

collections in which there are no natural taxonomies [16] According to a given genre

taxonomy many different approaches have been proposed to classify the music genre

for raw music tracks In general a music genre classification system consists of three

major aspects feature extraction feature selection and feature classification Fig 13

shows the block diagram of a music genre classification system

Fig 11 A hierarchical audio taxonomy

4

Fig 12 A hierarchical audio taxonomy

Fig 13 A music genre classification system

5

121 Feature Extraction

1211 Short-term Features

The most important aspect of music genre classification is to determine which

features are relevant and how to extract them Tzanetakis and Cook [1] employed

three feature sets including timbral texture rhythmic content and pitch content to

classify audio collections by their musical genres

12111 Timbral features

Timbral features are generally characterized by the properties related to

instrumentations or sound sources such as music speech or environment signals The

features used to represent timbral texture are described as follows

(1) Low-energy Feature it is defined as the percentage of analysis windows that

have RMS energy less than the average RMS energy across the texture window The

size of texture window should correspond to the minimum amount of time required to

identify a particular music texture

(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It

is defined as

])1[(])[(21 1

0summinus

=

minusminus=N

ntt nxsignnxsignZCR

where the sign function will return 1 for positive input and 0 for negative input and

xt[n] is the time domain signal for frame t

(3) Spectral Centroid spectral centroid is defined as the center of gravity of the

magnitude spectrum

][

][

1

1

sum

sum

=

=

times= N

nt

N

nt

t

nM

nnMC

6

where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the

magnitude of the n-th frequency bin of the t-th frame

(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of

the signal

( )

sum

sum

=

=

timesminus= N

nt

N

ntt

t

nM

nMCnSB

1

1

2

][

][

(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as

the frequency Rt below which 85 of the magnitude distribution is concentrated

sum sum=

minus

=

timesletR

k

N

k

kSkS0

1

0

][850][

(6) Spectral Flux The spectral flux measures the amount of local spectral change It

is defined as the squared difference between the normalized magnitudes of successive

spectral distributions

])[][(1

0

21sum

minus

=minusminus=

N

kttt kNkNSF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and

the (t-1)-th frame respectively

(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech

recognition due to their ability to represent the speech spectrum in a compact form In

human auditory system the perceived pitch is not linear with respect to the physical

frequency of the corresponding tone The mapping between the physical frequency

scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz

and logarithmic at higher frequencies In fact MFCC have been proven to be very

effective in automatic speech recognition and in modeling the subjective frequency

content of audio signals

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 5: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

IV

CONTENT ABSTRACThelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipII CONTENTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipIV CHAPTER 1helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 INTRODUCTIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1

11 Motivationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 12 Review of music genre classification systemhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip2

121 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 1211 Short-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5

12111 Timbral featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 12112 Rhythmic featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7 12113 Pitch featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7

1212 Long-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12121 Mean and standard deviationhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12122 Autoregressive modelhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip9 12123 Modulation spectrum analysishelliphelliphelliphelliphelliphelliphelliphelliphellip9

122 Linear discriminant analysis (LDA)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10 123 Feature Classifierhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10

13 Outline of Thesishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 CHAPTER 2helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 THE PROPOSED MUSIC GENRE CLASSIFICATION SYSTEMhelliphelliphelliphelliphelliphellip13

21 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip14 211 Mel-Frequency Cepstral Coefficient (MFCC)helliphelliphelliphelliphelliphelliphelliphellip14 212 Octave-based Spectral Contrast (OSC)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip17 213 Normalized Audio Spectral Envelope (NASE)helliphelliphelliphelliphelliphelliphelliphellip19 214 Modulation Spectral Analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip23

2141 Modulation Spectral Contrast of MFCC (MMFCC)helliphelliphellip23 2142 Modulation Spectral Contrast of OSC (MOSC)helliphelliphelliphellip25 2143 Modulation Spectral Contrast of NASE (MASE)helliphelliphelliphellip27

215 Statistical Aggregation of Modulation Spectral Feature valueshellip30 2151 Statistical Aggregation of MMFCC (SMMFCC)helliphelliphelliphellip30 2152 Statistical Aggregation of MOSC (SMOSC)helliphelliphelliphelliphelliphellip32 2153 Statistical Aggregation of MASE (SMASE)helliphelliphelliphelliphelliphellip33

216 Feature vector normalizationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip35 22 Linear discriminant analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip36 23 Music Genre Classificaiton phasehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

V

CHAPTER 3helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39 EXPERIMENT RESULTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

31 Comparison of row-based modulation spectral feature vectorhelliphelliphelliphellip40 32 Comparison of column-based modulation spectral feature vectorhelliphelliphellip42 33 Combination of row-based and column-based modulation spectral feature

vectorshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45 CHAPTER 4helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 CONCLUSIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 REFERENCEhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

1

Chapter 1

Introduction

11 Motivation

With the development of computer networks it becomes more and more popular

to purchase and download digital music from the Internet However a general music

database often contains millions of music tracks Hence it is very difficult to manage

such a large digital music database For this reason it will be helpful to manage a vast

amount of music tracks when they are properly categorized in advance In general the

retail or online music stores often organize their collections of music tracks by

categories such as genre artist and album Usually the category information of a

music track is manually labeled by experienced managers But to determine the

music genre of a music track by experienced managers is a laborious and

time-consuming work Therefore a number of supervised classification techniques

have been developed for automatic classification of unlabeled music tracks

[1-11]Thus in the study we focus on the music genre classification problem which is

defined as genre labeling of music tracks So that an automatic music genre

classification plays an important and preliminary role in music information retrieval

systems A new album or music track can be assigned to a proper genre in order to

place it in the appropriate section of an online music store or music database

To classify the music genre of a given music track some discriminating audio

features have to be extracted through content-based analysis of the music signal In

addition many studies try to examine a set of classifiers to improve the classification

performance However the improvement is limited and ineffective In fact employing

effective feature sets will have much more useful on the classification accuracy than

2

selecting a specific classifier [12] In the study a novel feature set derived from the

row-based and the column-based modulation spectrum analysis will be proposed for

automatic music genre classification

12 Review of Music Genre Classification Systems

The fundamental problem of a music genre classification system is to determine

the structure of the taxonomy that music pieces will be classified into However it is

hard to clearly define a universally agreed structure In general exploiting

hierarchical taxonomy structure for music genre classification has some merits (1)

People often prefer to search music by browsing the hierarchical catalogs (2)

Taxonomy structures identify the relationships or dependence between the music

genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification

approach to improve the classification efficiency and accuracy (3) The classification

errors become more acceptable by using taxonomy than direct music genre

classification The coarse-to-fine approach can make the classification errors

concentrate on a given level of the hierarchy

Burred and Lerch [13] have developed a hierarchical taxonomy for music genre

classification as shown in Fig 11 Rather than making a single decision to classify a

given music into one of all music genres (direct approach) the hierarchical approach

makes successive decisions at each branch point of the taxonomy hierarchy

Additionally appropriate and variant features can be employed at each branch point

of the taxonomy Therefore the hierarchical classification approach allows the

managers to trace at which level the classification errors occur frequently Barbedo

and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The

hierarchical structure was constructed in the bottom-up structure in stead of the

3

top-down structure This is because that it is easily to merge leaf classes into the same

parent class in the bottom-up structure Therefore the upper layer can be easily

constructed In their experiment result the classification accuracy which used the

hierarchical bottom-up approach outperforms the top-down approach by about 3 -

5

Li and Ogihara [15] investigated the effect of two different taxonomy structures

for music genre classification They also proposed an approach to automatic

generation of music genre taxonomies based on the confusion matrix computed by

linear discriminant projection This approach can reduce the time-consuming and

expensive task for manual construction of taxonomies It also helps to look for music

collections in which there are no natural taxonomies [16] According to a given genre

taxonomy many different approaches have been proposed to classify the music genre

for raw music tracks In general a music genre classification system consists of three

major aspects feature extraction feature selection and feature classification Fig 13

shows the block diagram of a music genre classification system

Fig 11 A hierarchical audio taxonomy

4

Fig 12 A hierarchical audio taxonomy

Fig 13 A music genre classification system

5

121 Feature Extraction

1211 Short-term Features

The most important aspect of music genre classification is to determine which

features are relevant and how to extract them Tzanetakis and Cook [1] employed

three feature sets including timbral texture rhythmic content and pitch content to

classify audio collections by their musical genres

12111 Timbral features

Timbral features are generally characterized by the properties related to

instrumentations or sound sources such as music speech or environment signals The

features used to represent timbral texture are described as follows

(1) Low-energy Feature it is defined as the percentage of analysis windows that

have RMS energy less than the average RMS energy across the texture window The

size of texture window should correspond to the minimum amount of time required to

identify a particular music texture

(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It

is defined as

])1[(])[(21 1

0summinus

=

minusminus=N

ntt nxsignnxsignZCR

where the sign function will return 1 for positive input and 0 for negative input and

xt[n] is the time domain signal for frame t

(3) Spectral Centroid spectral centroid is defined as the center of gravity of the

magnitude spectrum

][

][

1

1

sum

sum

=

=

times= N

nt

N

nt

t

nM

nnMC

6

where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the

magnitude of the n-th frequency bin of the t-th frame

(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of

the signal

( )

sum

sum

=

=

timesminus= N

nt

N

ntt

t

nM

nMCnSB

1

1

2

][

][

(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as

the frequency Rt below which 85 of the magnitude distribution is concentrated

sum sum=

minus

=

timesletR

k

N

k

kSkS0

1

0

][850][

(6) Spectral Flux The spectral flux measures the amount of local spectral change It

is defined as the squared difference between the normalized magnitudes of successive

spectral distributions

])[][(1

0

21sum

minus

=minusminus=

N

kttt kNkNSF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and

the (t-1)-th frame respectively

(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech

recognition due to their ability to represent the speech spectrum in a compact form In

human auditory system the perceived pitch is not linear with respect to the physical

frequency of the corresponding tone The mapping between the physical frequency

scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz

and logarithmic at higher frequencies In fact MFCC have been proven to be very

effective in automatic speech recognition and in modeling the subjective frequency

content of audio signals

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 6: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

V

CHAPTER 3helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39 EXPERIMENT RESULTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

31 Comparison of row-based modulation spectral feature vectorhelliphelliphelliphellip40 32 Comparison of column-based modulation spectral feature vectorhelliphelliphellip42 33 Combination of row-based and column-based modulation spectral feature

vectorshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45 CHAPTER 4helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 CONCLUSIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 REFERENCEhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

1

Chapter 1

Introduction

11 Motivation

With the development of computer networks it becomes more and more popular

to purchase and download digital music from the Internet However a general music

database often contains millions of music tracks Hence it is very difficult to manage

such a large digital music database For this reason it will be helpful to manage a vast

amount of music tracks when they are properly categorized in advance In general the

retail or online music stores often organize their collections of music tracks by

categories such as genre artist and album Usually the category information of a

music track is manually labeled by experienced managers But to determine the

music genre of a music track by experienced managers is a laborious and

time-consuming work Therefore a number of supervised classification techniques

have been developed for automatic classification of unlabeled music tracks

[1-11]Thus in the study we focus on the music genre classification problem which is

defined as genre labeling of music tracks So that an automatic music genre

classification plays an important and preliminary role in music information retrieval

systems A new album or music track can be assigned to a proper genre in order to

place it in the appropriate section of an online music store or music database

To classify the music genre of a given music track some discriminating audio

features have to be extracted through content-based analysis of the music signal In

addition many studies try to examine a set of classifiers to improve the classification

performance However the improvement is limited and ineffective In fact employing

effective feature sets will have much more useful on the classification accuracy than

2

selecting a specific classifier [12] In the study a novel feature set derived from the

row-based and the column-based modulation spectrum analysis will be proposed for

automatic music genre classification

12 Review of Music Genre Classification Systems

The fundamental problem of a music genre classification system is to determine

the structure of the taxonomy that music pieces will be classified into However it is

hard to clearly define a universally agreed structure In general exploiting

hierarchical taxonomy structure for music genre classification has some merits (1)

People often prefer to search music by browsing the hierarchical catalogs (2)

Taxonomy structures identify the relationships or dependence between the music

genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification

approach to improve the classification efficiency and accuracy (3) The classification

errors become more acceptable by using taxonomy than direct music genre

classification The coarse-to-fine approach can make the classification errors

concentrate on a given level of the hierarchy

Burred and Lerch [13] have developed a hierarchical taxonomy for music genre

classification as shown in Fig 11 Rather than making a single decision to classify a

given music into one of all music genres (direct approach) the hierarchical approach

makes successive decisions at each branch point of the taxonomy hierarchy

Additionally appropriate and variant features can be employed at each branch point

of the taxonomy Therefore the hierarchical classification approach allows the

managers to trace at which level the classification errors occur frequently Barbedo

and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The

hierarchical structure was constructed in the bottom-up structure in stead of the

3

top-down structure This is because that it is easily to merge leaf classes into the same

parent class in the bottom-up structure Therefore the upper layer can be easily

constructed In their experiment result the classification accuracy which used the

hierarchical bottom-up approach outperforms the top-down approach by about 3 -

5

Li and Ogihara [15] investigated the effect of two different taxonomy structures

for music genre classification They also proposed an approach to automatic

generation of music genre taxonomies based on the confusion matrix computed by

linear discriminant projection This approach can reduce the time-consuming and

expensive task for manual construction of taxonomies It also helps to look for music

collections in which there are no natural taxonomies [16] According to a given genre

taxonomy many different approaches have been proposed to classify the music genre

for raw music tracks In general a music genre classification system consists of three

major aspects feature extraction feature selection and feature classification Fig 13

shows the block diagram of a music genre classification system

Fig 11 A hierarchical audio taxonomy

4

Fig 12 A hierarchical audio taxonomy

Fig 13 A music genre classification system

5

121 Feature Extraction

1211 Short-term Features

The most important aspect of music genre classification is to determine which

features are relevant and how to extract them Tzanetakis and Cook [1] employed

three feature sets including timbral texture rhythmic content and pitch content to

classify audio collections by their musical genres

12111 Timbral features

Timbral features are generally characterized by the properties related to

instrumentations or sound sources such as music speech or environment signals The

features used to represent timbral texture are described as follows

(1) Low-energy Feature it is defined as the percentage of analysis windows that

have RMS energy less than the average RMS energy across the texture window The

size of texture window should correspond to the minimum amount of time required to

identify a particular music texture

(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It

is defined as

])1[(])[(21 1

0summinus

=

minusminus=N

ntt nxsignnxsignZCR

where the sign function will return 1 for positive input and 0 for negative input and

xt[n] is the time domain signal for frame t

(3) Spectral Centroid spectral centroid is defined as the center of gravity of the

magnitude spectrum

][

][

1

1

sum

sum

=

=

times= N

nt

N

nt

t

nM

nnMC

6

where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the

magnitude of the n-th frequency bin of the t-th frame

(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of

the signal

( )

sum

sum

=

=

timesminus= N

nt

N

ntt

t

nM

nMCnSB

1

1

2

][

][

(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as

the frequency Rt below which 85 of the magnitude distribution is concentrated

sum sum=

minus

=

timesletR

k

N

k

kSkS0

1

0

][850][

(6) Spectral Flux The spectral flux measures the amount of local spectral change It

is defined as the squared difference between the normalized magnitudes of successive

spectral distributions

])[][(1

0

21sum

minus

=minusminus=

N

kttt kNkNSF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and

the (t-1)-th frame respectively

(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech

recognition due to their ability to represent the speech spectrum in a compact form In

human auditory system the perceived pitch is not linear with respect to the physical

frequency of the corresponding tone The mapping between the physical frequency

scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz

and logarithmic at higher frequencies In fact MFCC have been proven to be very

effective in automatic speech recognition and in modeling the subjective frequency

content of audio signals

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 7: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

1

Chapter 1

Introduction

11 Motivation

With the development of computer networks it becomes more and more popular

to purchase and download digital music from the Internet However a general music

database often contains millions of music tracks Hence it is very difficult to manage

such a large digital music database For this reason it will be helpful to manage a vast

amount of music tracks when they are properly categorized in advance In general the

retail or online music stores often organize their collections of music tracks by

categories such as genre artist and album Usually the category information of a

music track is manually labeled by experienced managers But to determine the

music genre of a music track by experienced managers is a laborious and

time-consuming work Therefore a number of supervised classification techniques

have been developed for automatic classification of unlabeled music tracks

[1-11]Thus in the study we focus on the music genre classification problem which is

defined as genre labeling of music tracks So that an automatic music genre

classification plays an important and preliminary role in music information retrieval

systems A new album or music track can be assigned to a proper genre in order to

place it in the appropriate section of an online music store or music database

To classify the music genre of a given music track some discriminating audio

features have to be extracted through content-based analysis of the music signal In

addition many studies try to examine a set of classifiers to improve the classification

performance However the improvement is limited and ineffective In fact employing

effective feature sets will have much more useful on the classification accuracy than

2

selecting a specific classifier [12] In the study a novel feature set derived from the

row-based and the column-based modulation spectrum analysis will be proposed for

automatic music genre classification

12 Review of Music Genre Classification Systems

The fundamental problem of a music genre classification system is to determine

the structure of the taxonomy that music pieces will be classified into However it is

hard to clearly define a universally agreed structure In general exploiting

hierarchical taxonomy structure for music genre classification has some merits (1)

People often prefer to search music by browsing the hierarchical catalogs (2)

Taxonomy structures identify the relationships or dependence between the music

genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification

approach to improve the classification efficiency and accuracy (3) The classification

errors become more acceptable by using taxonomy than direct music genre

classification The coarse-to-fine approach can make the classification errors

concentrate on a given level of the hierarchy

Burred and Lerch [13] have developed a hierarchical taxonomy for music genre

classification as shown in Fig 11 Rather than making a single decision to classify a

given music into one of all music genres (direct approach) the hierarchical approach

makes successive decisions at each branch point of the taxonomy hierarchy

Additionally appropriate and variant features can be employed at each branch point

of the taxonomy Therefore the hierarchical classification approach allows the

managers to trace at which level the classification errors occur frequently Barbedo

and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The

hierarchical structure was constructed in the bottom-up structure in stead of the

3

top-down structure This is because that it is easily to merge leaf classes into the same

parent class in the bottom-up structure Therefore the upper layer can be easily

constructed In their experiment result the classification accuracy which used the

hierarchical bottom-up approach outperforms the top-down approach by about 3 -

5

Li and Ogihara [15] investigated the effect of two different taxonomy structures

for music genre classification They also proposed an approach to automatic

generation of music genre taxonomies based on the confusion matrix computed by

linear discriminant projection This approach can reduce the time-consuming and

expensive task for manual construction of taxonomies It also helps to look for music

collections in which there are no natural taxonomies [16] According to a given genre

taxonomy many different approaches have been proposed to classify the music genre

for raw music tracks In general a music genre classification system consists of three

major aspects feature extraction feature selection and feature classification Fig 13

shows the block diagram of a music genre classification system

Fig 11 A hierarchical audio taxonomy

4

Fig 12 A hierarchical audio taxonomy

Fig 13 A music genre classification system

5

121 Feature Extraction

1211 Short-term Features

The most important aspect of music genre classification is to determine which

features are relevant and how to extract them Tzanetakis and Cook [1] employed

three feature sets including timbral texture rhythmic content and pitch content to

classify audio collections by their musical genres

12111 Timbral features

Timbral features are generally characterized by the properties related to

instrumentations or sound sources such as music speech or environment signals The

features used to represent timbral texture are described as follows

(1) Low-energy Feature it is defined as the percentage of analysis windows that

have RMS energy less than the average RMS energy across the texture window The

size of texture window should correspond to the minimum amount of time required to

identify a particular music texture

(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It

is defined as

])1[(])[(21 1

0summinus

=

minusminus=N

ntt nxsignnxsignZCR

where the sign function will return 1 for positive input and 0 for negative input and

xt[n] is the time domain signal for frame t

(3) Spectral Centroid spectral centroid is defined as the center of gravity of the

magnitude spectrum

][

][

1

1

sum

sum

=

=

times= N

nt

N

nt

t

nM

nnMC

6

where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the

magnitude of the n-th frequency bin of the t-th frame

(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of

the signal

( )

sum

sum

=

=

timesminus= N

nt

N

ntt

t

nM

nMCnSB

1

1

2

][

][

(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as

the frequency Rt below which 85 of the magnitude distribution is concentrated

sum sum=

minus

=

timesletR

k

N

k

kSkS0

1

0

][850][

(6) Spectral Flux The spectral flux measures the amount of local spectral change It

is defined as the squared difference between the normalized magnitudes of successive

spectral distributions

])[][(1

0

21sum

minus

=minusminus=

N

kttt kNkNSF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and

the (t-1)-th frame respectively

(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech

recognition due to their ability to represent the speech spectrum in a compact form In

human auditory system the perceived pitch is not linear with respect to the physical

frequency of the corresponding tone The mapping between the physical frequency

scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz

and logarithmic at higher frequencies In fact MFCC have been proven to be very

effective in automatic speech recognition and in modeling the subjective frequency

content of audio signals

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 8: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

2

selecting a specific classifier [12] In the study a novel feature set derived from the

row-based and the column-based modulation spectrum analysis will be proposed for

automatic music genre classification

12 Review of Music Genre Classification Systems

The fundamental problem of a music genre classification system is to determine

the structure of the taxonomy that music pieces will be classified into However it is

hard to clearly define a universally agreed structure In general exploiting

hierarchical taxonomy structure for music genre classification has some merits (1)

People often prefer to search music by browsing the hierarchical catalogs (2)

Taxonomy structures identify the relationships or dependence between the music

genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification

approach to improve the classification efficiency and accuracy (3) The classification

errors become more acceptable by using taxonomy than direct music genre

classification The coarse-to-fine approach can make the classification errors

concentrate on a given level of the hierarchy

Burred and Lerch [13] have developed a hierarchical taxonomy for music genre

classification as shown in Fig 11 Rather than making a single decision to classify a

given music into one of all music genres (direct approach) the hierarchical approach

makes successive decisions at each branch point of the taxonomy hierarchy

Additionally appropriate and variant features can be employed at each branch point

of the taxonomy Therefore the hierarchical classification approach allows the

managers to trace at which level the classification errors occur frequently Barbedo

and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The

hierarchical structure was constructed in the bottom-up structure in stead of the

3

top-down structure This is because that it is easily to merge leaf classes into the same

parent class in the bottom-up structure Therefore the upper layer can be easily

constructed In their experiment result the classification accuracy which used the

hierarchical bottom-up approach outperforms the top-down approach by about 3 -

5

Li and Ogihara [15] investigated the effect of two different taxonomy structures

for music genre classification They also proposed an approach to automatic

generation of music genre taxonomies based on the confusion matrix computed by

linear discriminant projection This approach can reduce the time-consuming and

expensive task for manual construction of taxonomies It also helps to look for music

collections in which there are no natural taxonomies [16] According to a given genre

taxonomy many different approaches have been proposed to classify the music genre

for raw music tracks In general a music genre classification system consists of three

major aspects feature extraction feature selection and feature classification Fig 13

shows the block diagram of a music genre classification system

Fig 11 A hierarchical audio taxonomy

4

Fig 12 A hierarchical audio taxonomy

Fig 13 A music genre classification system

5

121 Feature Extraction

1211 Short-term Features

The most important aspect of music genre classification is to determine which

features are relevant and how to extract them Tzanetakis and Cook [1] employed

three feature sets including timbral texture rhythmic content and pitch content to

classify audio collections by their musical genres

12111 Timbral features

Timbral features are generally characterized by the properties related to

instrumentations or sound sources such as music speech or environment signals The

features used to represent timbral texture are described as follows

(1) Low-energy Feature it is defined as the percentage of analysis windows that

have RMS energy less than the average RMS energy across the texture window The

size of texture window should correspond to the minimum amount of time required to

identify a particular music texture

(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It

is defined as

])1[(])[(21 1

0summinus

=

minusminus=N

ntt nxsignnxsignZCR

where the sign function will return 1 for positive input and 0 for negative input and

xt[n] is the time domain signal for frame t

(3) Spectral Centroid spectral centroid is defined as the center of gravity of the

magnitude spectrum

][

][

1

1

sum

sum

=

=

times= N

nt

N

nt

t

nM

nnMC

6

where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the

magnitude of the n-th frequency bin of the t-th frame

(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of

the signal

( )

sum

sum

=

=

timesminus= N

nt

N

ntt

t

nM

nMCnSB

1

1

2

][

][

(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as

the frequency Rt below which 85 of the magnitude distribution is concentrated

sum sum=

minus

=

timesletR

k

N

k

kSkS0

1

0

][850][

(6) Spectral Flux The spectral flux measures the amount of local spectral change It

is defined as the squared difference between the normalized magnitudes of successive

spectral distributions

])[][(1

0

21sum

minus

=minusminus=

N

kttt kNkNSF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and

the (t-1)-th frame respectively

(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech

recognition due to their ability to represent the speech spectrum in a compact form In

human auditory system the perceived pitch is not linear with respect to the physical

frequency of the corresponding tone The mapping between the physical frequency

scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz

and logarithmic at higher frequencies In fact MFCC have been proven to be very

effective in automatic speech recognition and in modeling the subjective frequency

content of audio signals

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 9: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

3

top-down structure This is because that it is easily to merge leaf classes into the same

parent class in the bottom-up structure Therefore the upper layer can be easily

constructed In their experiment result the classification accuracy which used the

hierarchical bottom-up approach outperforms the top-down approach by about 3 -

5

Li and Ogihara [15] investigated the effect of two different taxonomy structures

for music genre classification They also proposed an approach to automatic

generation of music genre taxonomies based on the confusion matrix computed by

linear discriminant projection This approach can reduce the time-consuming and

expensive task for manual construction of taxonomies It also helps to look for music

collections in which there are no natural taxonomies [16] According to a given genre

taxonomy many different approaches have been proposed to classify the music genre

for raw music tracks In general a music genre classification system consists of three

major aspects feature extraction feature selection and feature classification Fig 13

shows the block diagram of a music genre classification system

Fig 11 A hierarchical audio taxonomy

4

Fig 12 A hierarchical audio taxonomy

Fig 13 A music genre classification system

5

121 Feature Extraction

1211 Short-term Features

The most important aspect of music genre classification is to determine which

features are relevant and how to extract them Tzanetakis and Cook [1] employed

three feature sets including timbral texture rhythmic content and pitch content to

classify audio collections by their musical genres

12111 Timbral features

Timbral features are generally characterized by the properties related to

instrumentations or sound sources such as music speech or environment signals The

features used to represent timbral texture are described as follows

(1) Low-energy Feature it is defined as the percentage of analysis windows that

have RMS energy less than the average RMS energy across the texture window The

size of texture window should correspond to the minimum amount of time required to

identify a particular music texture

(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It

is defined as

])1[(])[(21 1

0summinus

=

minusminus=N

ntt nxsignnxsignZCR

where the sign function will return 1 for positive input and 0 for negative input and

xt[n] is the time domain signal for frame t

(3) Spectral Centroid spectral centroid is defined as the center of gravity of the

magnitude spectrum

][

][

1

1

sum

sum

=

=

times= N

nt

N

nt

t

nM

nnMC

6

where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the

magnitude of the n-th frequency bin of the t-th frame

(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of

the signal

( )

sum

sum

=

=

timesminus= N

nt

N

ntt

t

nM

nMCnSB

1

1

2

][

][

(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as

the frequency Rt below which 85 of the magnitude distribution is concentrated

sum sum=

minus

=

timesletR

k

N

k

kSkS0

1

0

][850][

(6) Spectral Flux The spectral flux measures the amount of local spectral change It

is defined as the squared difference between the normalized magnitudes of successive

spectral distributions

])[][(1

0

21sum

minus

=minusminus=

N

kttt kNkNSF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and

the (t-1)-th frame respectively

(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech

recognition due to their ability to represent the speech spectrum in a compact form In

human auditory system the perceived pitch is not linear with respect to the physical

frequency of the corresponding tone The mapping between the physical frequency

scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz

and logarithmic at higher frequencies In fact MFCC have been proven to be very

effective in automatic speech recognition and in modeling the subjective frequency

content of audio signals

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 10: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

4

Fig 12 A hierarchical audio taxonomy

Fig 13 A music genre classification system

5

121 Feature Extraction

1211 Short-term Features

The most important aspect of music genre classification is to determine which

features are relevant and how to extract them Tzanetakis and Cook [1] employed

three feature sets including timbral texture rhythmic content and pitch content to

classify audio collections by their musical genres

12111 Timbral features

Timbral features are generally characterized by the properties related to

instrumentations or sound sources such as music speech or environment signals The

features used to represent timbral texture are described as follows

(1) Low-energy Feature it is defined as the percentage of analysis windows that

have RMS energy less than the average RMS energy across the texture window The

size of texture window should correspond to the minimum amount of time required to

identify a particular music texture

(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It

is defined as

])1[(])[(21 1

0summinus

=

minusminus=N

ntt nxsignnxsignZCR

where the sign function will return 1 for positive input and 0 for negative input and

xt[n] is the time domain signal for frame t

(3) Spectral Centroid spectral centroid is defined as the center of gravity of the

magnitude spectrum

][

][

1

1

sum

sum

=

=

times= N

nt

N

nt

t

nM

nnMC

6

where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the

magnitude of the n-th frequency bin of the t-th frame

(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of

the signal

( )

sum

sum

=

=

timesminus= N

nt

N

ntt

t

nM

nMCnSB

1

1

2

][

][

(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as

the frequency Rt below which 85 of the magnitude distribution is concentrated

sum sum=

minus

=

timesletR

k

N

k

kSkS0

1

0

][850][

(6) Spectral Flux The spectral flux measures the amount of local spectral change It

is defined as the squared difference between the normalized magnitudes of successive

spectral distributions

])[][(1

0

21sum

minus

=minusminus=

N

kttt kNkNSF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and

the (t-1)-th frame respectively

(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech

recognition due to their ability to represent the speech spectrum in a compact form In

human auditory system the perceived pitch is not linear with respect to the physical

frequency of the corresponding tone The mapping between the physical frequency

scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz

and logarithmic at higher frequencies In fact MFCC have been proven to be very

effective in automatic speech recognition and in modeling the subjective frequency

content of audio signals

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 11: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

5

121 Feature Extraction

1211 Short-term Features

The most important aspect of music genre classification is to determine which

features are relevant and how to extract them Tzanetakis and Cook [1] employed

three feature sets including timbral texture rhythmic content and pitch content to

classify audio collections by their musical genres

12111 Timbral features

Timbral features are generally characterized by the properties related to

instrumentations or sound sources such as music speech or environment signals The

features used to represent timbral texture are described as follows

(1) Low-energy Feature it is defined as the percentage of analysis windows that

have RMS energy less than the average RMS energy across the texture window The

size of texture window should correspond to the minimum amount of time required to

identify a particular music texture

(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It

is defined as

])1[(])[(21 1

0summinus

=

minusminus=N

ntt nxsignnxsignZCR

where the sign function will return 1 for positive input and 0 for negative input and

xt[n] is the time domain signal for frame t

(3) Spectral Centroid spectral centroid is defined as the center of gravity of the

magnitude spectrum

][

][

1

1

sum

sum

=

=

times= N

nt

N

nt

t

nM

nnMC

6

where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the

magnitude of the n-th frequency bin of the t-th frame

(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of

the signal

( )

sum

sum

=

=

timesminus= N

nt

N

ntt

t

nM

nMCnSB

1

1

2

][

][

(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as

the frequency Rt below which 85 of the magnitude distribution is concentrated

sum sum=

minus

=

timesletR

k

N

k

kSkS0

1

0

][850][

(6) Spectral Flux The spectral flux measures the amount of local spectral change It

is defined as the squared difference between the normalized magnitudes of successive

spectral distributions

])[][(1

0

21sum

minus

=minusminus=

N

kttt kNkNSF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and

the (t-1)-th frame respectively

(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech

recognition due to their ability to represent the speech spectrum in a compact form In

human auditory system the perceived pitch is not linear with respect to the physical

frequency of the corresponding tone The mapping between the physical frequency

scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz

and logarithmic at higher frequencies In fact MFCC have been proven to be very

effective in automatic speech recognition and in modeling the subjective frequency

content of audio signals

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 12: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

6

where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the

magnitude of the n-th frequency bin of the t-th frame

(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of

the signal

( )

sum

sum

=

=

timesminus= N

nt

N

ntt

t

nM

nMCnSB

1

1

2

][

][

(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as

the frequency Rt below which 85 of the magnitude distribution is concentrated

sum sum=

minus

=

timesletR

k

N

k

kSkS0

1

0

][850][

(6) Spectral Flux The spectral flux measures the amount of local spectral change It

is defined as the squared difference between the normalized magnitudes of successive

spectral distributions

])[][(1

0

21sum

minus

=minusminus=

N

kttt kNkNSF

where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and

the (t-1)-th frame respectively

(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech

recognition due to their ability to represent the speech spectrum in a compact form In

human auditory system the perceived pitch is not linear with respect to the physical

frequency of the corresponding tone The mapping between the physical frequency

scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz

and logarithmic at higher frequencies In fact MFCC have been proven to be very

effective in automatic speech recognition and in modeling the subjective frequency

content of audio signals

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 13: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

7

(8) Octave-based spectral contrast (OSC) OSC was developed to represent the

spectral characteristics of a music piece [3] This feature describes the strength of

spectral peaks and spectral valleys in each sub-band separately It can roughly reflect

the distribution of harmonic and non-harmonic components

(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7

standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the

log power spectrum in each logarithmic subband Then each ASE coefficient is

normalized with the Root Mean Square(RMS) energy yielding a normalized version

of the ASE called NASE

12112 Rhythmic features

The features representing the rhythmic content of a music piece are mainly

derived from the beat histogram including the overall beat strength the main beat and

its strength the period of the main beat and subbeats the relative strength of subbeats

to main beat Many beat-tracking algorithms [18 19] providing an estimate of the

main beat and the corresponding strength have been proposed

12113 Pitch features

Tzanetakis et al [20] extracted pitch features from the pitch histograms of a

music piece The extracted pitch features contain frequency pitch strength and pitch

interval The pitch histogram can be estimated by multiple pitch detection techniques

[21 22] Melody and harmony have been widely used by musicologists to study the

musical structures Scaringella et al [23] proposed a method to extract melody and

harmony features by characterizing the pitch distribution of a short segment like most

melodyharmony analyzers The main difference is that no fundamental frequency

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 14: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

8

chord key or other high-level feature has to determine in advance

1212 Long-term Features

To find the representative feature vector of a whole music piece the methods

employed to integrate the short-term features into a long-term feature include mean

and standard deviation autoregressive model [9] modulation spectrum analysis [24

25 26] and nonlinear time series analysis

12121 Mean and standard deviation

The mean and standard deviation operation is the most used method to integrate

the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative

D-dimensional feature vector of the i-th frame The mean and standard deviation is

calculated as follow

summinus

=

=1

0][1][

T

ii dx

Tdμ 10 minuslele Dd

211

0

2 ]])[][(1[][ summinus

=

minus=T

ii ddx

Td μσ 10 minuslele Dd

where T is the number of frames of the input signal This statistical method exhibits

no information about the relationship between features as well as the time-varying

behavior of music signals

12122 Autoregressive model (AR model)

Meng et al [9] used AR model to analyze the time-varying texture of music

signals They proposed the diagonal autoregressive (DAR) and multivariate

autoregressive (MAR) analysis to integrate the short-term features In DAR each

short-term feature is independently modeled by an AR model The extracted feature

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 15: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

9

vector includes the mean and variance of all short-term feature vectors as well as the

coefficients of each AR model In MAR all short-term features are modeled by a

MAR model The difference between MAR model and AR model is that MAR

considers the relationship between features The features used in MAR include the

mean vector the covariance matrix of all shorter-term feature vectors and the

coefficients of the MAR model In addition for a p-order MAR model the feature

dimension is p times D times D where D is the feature dimension of a short-term feature

vector

12123 Modulation spectrum analysis

The idea of modulation spectrum analysis is to model the frequency variability of

signals along the time axis Kingsbury et al [24] first employed modulation

spectrogram for speech recognition It has been shown that the most sensitive

modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used

modulation spectrum analysis for music content identification They showed that

modulation-scale features along with subband normalization are insensitive to

convolutional noise Shi et al [26] used modulation spectrum analysis to model the

long-term characteristics of music signals in order to extract the tempo feature for

music emotion classification

12124 Nonlinear time series analysis

Non-linear analysis of time series offers an alternative way to describe temporal

structure which is complementary to the analysis of linear correlation and spectral

properties Mierswa and Morik [27] used the reconstructed phase space to extract

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 16: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

10

features directly from the audio data The mean and standard deviations of the

distances and angles in the phase space with an embedding dimension of two and unit

time lag were used

122 Linear discriminant analysis (LDA)

LDA [28] aims at improving the classification accuracy at a lower dimensional

feature vector space LDA deals with discrimination between classes rather than

representations of various classes The goal of LDA is to minimize the within-class

distance while maximize the between-class distance In LDA an optimal

transformation matrix from an n-dimensional feature space to d-dimensional space is

determined where d le n The transformation should enhance the separability among

different classes The optimal transformation matrix can be exploited to map each

n-dimensional feature vector into a d-dimensional vector The detailed steps will be

described in Chapter 2

In LDA each class is generally modeled by a single Gaussian distribution In fact the

music signal is too complexity to be modeled by a single Gaussian distribution In

addition the same transformation matrix of LDA is used for all the classes which

doesnrsquot consider the class-wise differences

122 Feature Classifier

Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch

features with GMM classifier to their music genre classification system The

hierarchical genres adopted in their music classification system are Classical Country

Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 17: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

11

sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the

sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The

experiment result shows that GMM with three components achieves the best

classification accuracy

West and Cox [4] constructed a hierarchical framed based music genre

classification system In their classification system a majority vote is taken to decide

the final classification The genres adopted in their music classification system are

Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC

and OSC as features and compare the performance withwithout decision tree

classifier of Gaussian classifier GMM with three components and LDA In their

experiment the feature vector with GMM classifier and decision tree classifier has

best accuracy 8279

Xu et al [29] applied SVM to discriminate between pure music and vocal one

The SVM learning algorithm is applied to obtain the classification parameters

according to the calculated features It is demonstrated that SVM achieves better

performance than traditional Euclidean distance methods and hidden Markov model

(HMM) methods

Esmaili et al [30] use some low-level features (MFCC entropy centroid

bandwidth etc) and LDA for music genre classification In their system the

classification accuracy is 930 for the classification of five music genres Rock

Classical Folk Jazz and Pop

Bagci and Erzin [8] constructed a novel frame-based music genre classification

system In their classification system some invalid frames are first detected and

discarded for classification purpose To determine whether a frame is valid or not a

GMM model is constructed for each music genre These GMM models are then used

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 18: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

12

to sift the frames which are unable to be correctly classified and each GMM model of

a music genre is updated for each correctly classified frame Moreover a GMM

model is employed to represent the invalid frames In their experiment the feature

vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral

roll-off spectral flux and zero-crossing rate) as well as the first- and second-order

derivative of these timbral features Their musical genre dataset includes ten genre

types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock

The classification accuracy can up to 8860 when the frame length is 30s and each

GMM is modeled by 48 Gaussian distributions

Umapathy et al [31] used local discriminant bases (LDB) technique to measure

the dissimilarity of the LDB nodes of any two classes and extract features from these

high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition

to construct a five-level tree for a music signal Then two novel features the energy

distribution over frequencies (D1) and nonstationarity index (D2) are used to measure

the dissimilarity of the LDB nodes of any two classes In their classification system

the feature dimension is 30 including the energies and variances of the basis vector

coefficients of the first 15 high dissimilarity nodes The experiment results show that

when the LDB feature vector is combined with MFCC and by using LDA analysis

the average classification accuracy for the first level is 91 (artificial and natural

sounds) for the second level is 99 (instrumental and automobile human and

nonhuman) and 95 for the third level (drums flute and piano aircraft and

helicopter male and female speech animals birds and insects)

Grimaldi et al [11 32] used a set of features based on discrete wavelet packet

transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a

well-known signal analysis methodology able to approximate a real signal at different

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 19: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

13

scales both in time and frequency domain Taking into account the non-stationary

nature of the input signal the DWT provides an approximation with excellent time

and frequency resolution WPT is a variant of DWT which is achieved by recursively

convolving the input signal with a pair of low pass and high pass filters Unlike DWT

that recursively decomposes only the low-pass subband the WPDT decomposes both

bands at each level

Bergatra et al [33] used AdaBoost for music classification AdaBoost is an

ensemble (or meta-learning) method that constructs a classifier in an iterative fashion

[34] It was originally designed for binary classification and was later extended to

multiclass classification using several different strategies

13 Outline of Thesis

In Chapter 2 the proposed method for music genre classification will be

introduced In Chapter 3 some experiments will be presented to show the

effectiveness of the proposed method Finally conclusion will be given in Chapter 4

Chapter 2

The Proposed Music Genre Classification Method

The proposed music genre classification system consists of two phases the

training phase and the classification phase The training phase is composed of two

main modules feature extraction and linear discriminant analysis (LDA) The

classification phase consists of three modules feature extraction LDA transformation

and classification The block diagram of the proposed music genre classification

system is the same as that shown in Fig 12 A detailed description of each module

will be described below

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 20: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

14

21 Feature Extraction

A novel feature set derived from modulation spectral analysis of the spectral

(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is

proposed for music genre classification

211 Mel-Frequency Cepstral Coefficients (MFCC)

MFCC have been widely used for speech recognition due to their ability to

represent the speech spectrum in a compact form In fact MFCC have been proven to

be very effective in automatic speech recognition and in modeling the subjective

frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from

an input signal The detailed steps will be given below

Step 1 Pre-emphasis

]1[ˆ][][ˆ minustimesminus= nsansns (1)

where s[n] is the current sample and s[nminus1] is the previous sample a typical

value for is 095 a

Step 2 Framing

Each music signal is divided into a set of overlapped frames (frame size = N

samples) Each pair of consecutive frames is overlapped M samples

Step 3 Windowing

Each frame is multiplied by a Hamming window

][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)

where the Hamming window function w[n] is defined as

)1

2cos( 460540][minus

minus=N

nnw π 10 minuslele Nn (3)

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 21: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

15

Step 4 Spectral Analysis

Take the discrete Fourier transform of each frame using FFT

][~][1

0

2

summinus

=

minus=

N

n

nNkj

ii enskXπ

10 minuslele Nk (4)

where k is the frequency index

Step 5 Mel-scale Band-Pass Filtering

The spectrum is then decomposed into a number of subbands by using a set

of Mel-scale band-pass filters

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (5)

Where B is the total number of filters (B is 25 in the study) Ibl and Ibh

denote respectively the low-frequency index and high-frequency index of the

b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is

|][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (6)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter as shown in Table 21

Step 6 Discrete cosine transform (DCT)

MFCC can be obtained by applying DCT on the logarithm of E(b)

0 ))50(cos( ))(1(log)(1

010 Llb

BlbElMFCC

B

bi ltle++= sum

minus

=

π (7)

where L is the length of MFCC feature vector (L is 20 in the study)

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 22: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

16

Therefore the MFCC feature vector can be represented as follows

xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)

Fig 21 The flowchart for computing MFCC

Pre-emphasis

Input Signal

Framing

Windowing

FFT

Mel-scale band-pass filtering

DCT

MFCC

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 23: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

17

Table 21 The range of each triangular band-pass filter

Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]

212 Octave-based Spectral Contrast (OSC)

OSC was developed to represent the spectral characteristics of a music signal It

considers the spectral peak and valley in each subband independently In general

spectral peaks correspond to harmonic components and spectral valleys the

non-harmonic components or noise in music signals Therefore the difference

between spectral peaks and spectral valleys will reflect the spectral contrast

distribution Fig 22 shows the block diagram for extracting the OSC feature The

detailed steps will be described below

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 24: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

18

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and FFT is then to obtain the corresponding spectrum of each frame

Step 2 Octave Scale Filtering

This spectrum is then divided into a number of subbands by the set of octave

scale filters shown in Table 22 The octave scale filtering operation can be

described as follows

][)(

sum=

=hb

lb

I

Ikii kAbE 120 0 minusleleltle NkBb (9)

where B is the number of subbands Ibl and Ibh denote respectively the

low-frequency index and high-frequency index of the b-th band-pass filter

Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =

Ibl and Ibh are given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (10)

where fs is the sampling frequency fbl and fbh are the low frequency and high

frequency of the b-th band-pass filter

Step 3 Peak Valley Selection

Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th

subband Nb is the number of FFT frequency bins in the b-th subband

Without loss of generality let the magnitude spectrum be sorted in a

decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and

spectral valley in the b-th subband are then estimated as follows

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 25: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

19

)1log()(1

sum=

=bN

iib

b

MN

bPeakα

α (11)

)1log()(1

1sum=

+minus=b

b

N

iiNb

b

MN

bValleyα

α (12)

where α is a neighborhood factor (α is 02 in this study) The spectral

contrast is given by the difference between the spectral peak and the spectral

valley

)( )()( bValleybPeakbSC minus= (13)

The feature vector of an audio frame consists of the spectral contrasts and the

spectral valleys of all subbands Thus the OSC feature vector of an audio frame can

be represented as follows

xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)

Input Signal

Framing

Octave scale filtering

PeakValley Selection

Spectral Contrast

OSC

FFT

Fig 22 The flowchart for computing OSC

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 26: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

20

Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)

Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)

213 Normalized Audio Spectral Envelope (NASE)

NASE was defined in MPEG-7 for sound classification The NASE descriptor

provides a representation of the power spectrum of each audio frame Each

component of the NASE feature vector represents the normalized magnitude of a

particular frequency subband Fig 23 shows the block diagram for extracting the

NASE feature For a given music piece the main steps for computing NASE are

described as follow

Step 1 Framing and Spectral Analysis

An input music signal is divided into a number of successive overlapped

frames and each audio frame is multiplied by a Hamming window function

and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N

where N is the size of FFT The power spectrum is defined as the normalized

squared magnitude of the DFT spectrum X(k)

20|)(|2

2 0|)(|1

)(2

2

⎪⎪⎩

⎪⎪⎨

ltltsdot

=sdot=

NkkXEN

NkkXENkP

w

w (15)

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 27: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

21

where Ew is the energy of the Hamming window function w(n) of size Nw

|)(|1

0

2summinus

=

=wN

nw nwE (16)

Step 2 Subband Decomposition

The power spectrum is divided into logarithmically spaced subbands

spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a

spectrum of 8 octave interval (see Fig24) The NASE scale filtering

operation can be described as follows(see Table 23)

)()(

sum=

=hb

lb

I

Ikii kPbASE 120 0 minusleleltle NkBb

(17)

where B is the number of logarithmic subbands within the frequency range

[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of

the frequency subbands ranging from 116 of an octave to 8 octaves(B=16

r=12 in the study)

(18) 34 octaves 2 leleminus= jr j

Ibl and Ibh are the low-frequency index and high-frequency index of the b-th

band-pass filter given as

)(

)(

Nff

I

Nff

I

s

hbhb

s

lblb

=

= (19)

where fs is the sampling frequency fbl and fbh are the low frequency and

high frequency of the b-th band-pass filter

Step 3 Normalized Audio Spectral Envelope

The ASE coefficient for the b-th subband is defined as the sum of power

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 28: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

22

spectrum coefficients within this subband

(20) 10 )()(

+lele= sum=

BbkPbASEhb

lb

I

Ik

Each ASE coefficient is then converted to the decibel scale

10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)

The NASE coefficient is derived by normalizing each decibel-scale ASE

coefficient with the root-mean-square (RMS) norm gain value R

10 )()( +lele= BbR

bASEbNASE dB (22)

where the RMS-norm gain value R is defined as

))((1

0

2sum+

=

=B

bdB bASER (23)

In MPEG-7 the ASE coefficients consist of one coefficient representing power

between 0 Hz and loEdge a series of coefficients representing power in

logarithmically spaced bands between loEdge and hiEdge a coefficient representing

power above hiEdge the RMS-norm gain value R Therefore the feature dimension

of NASE is B+3 Thus the NASE feature vector of an audio frame will be

represented as follows

xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 29: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

23

Framing

Input Signal

Windowing

FFT

Normalized Audio Spectral Envelope

NASE

Subband Decomposition

Fig 23 The flowchart for computing NASE

625 125 250 500 1K 2K 4K 8K 16K

884 1768 3536 7071 14142 28284 56569 113137

1 coeff 16 coeffs 1 coeff

loEdge hiEdge

Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution

r = 12

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 30: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

24

Table 23 The range of each Normalized audio spectral evenlope band-pass filter

Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]

214 Modulation Spectral Analysis

MFCC OSC and NASE capture only short-term frame-based characteristics of

audio signals In order to capture the time-varying behavior of the music signals We

employ modulation spectral analysis on MFCC OSC and NASE to observe the

variations of the sound

2141 Modulation Spectral Contrast of MFCC (MMFCC)

To observe the time-varying behavior of MFCC modulation spectral analysis is

applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC

and the detailed steps will be described below

Step 1 Framing and MFCC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the MFCC coefficients of each frame

Step 2 Modulation Spectrum Analysis

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 31: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

25

Let be the l-th MFCC feature value of the i-th frame

The modulation spectrogram is obtained by applying FFT independently on

each feature value along the time trajectory within a texture window of

length W

][lMFCCi Ll ltle0

0 0 )()(1

0

2

)2( LlWmelMFCClmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (25)

where Mt(m l) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and l is the MFCC coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

LlWmlmMT

lmMT

tt

MFCC ltleltle= sum=

(26)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

( ))(max)(

lmMljMSP MFCC

ΦmΦ

MFCC

hjlj ltle= (27)

( ))(min)(

lmMljMSV MFCC

ΦmΦ

MFCC

hjlj ltle= (28)

where Φjl and Φjh are respectively the low modulation frequency index and

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 32: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

26

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=

As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 25 the flowchart for extracting MMFCC

2142 Modulation Spectral Contrast of OSC (MOSC)

To observe the time-varying behavior of OSC the same modulation spectrum

analysis is applied to the OSC feature values Fig 26 shows the flowchart for

extracting MOSC and the detailed steps will be described below

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 33: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

27

Step 1 Framing and OSC Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the OSC coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th OSC of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dOSCi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedOSCdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (30)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the OSC coefficient index In the

study W is 512 which is about 6 seconds with 50 overlap between two

successive texture windows The representative modulation spectrogram of a

music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

OSC ltleltle= sum=

(31)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands In the study

the number of modulation subbands is 8 (J = 8) The frequency interval of

each modulation subband is shown in Table 24 For each feature value the

modulation spectral peak (MSP) and modulation spectral valley (MSV)

within each modulation subband are then evaluated

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 34: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

28

( ))(max)(

dmMdjMSP OSC

ΦmΦ

OSC

hjlj ltle= (32)

( ))(min)(

dmMdjMSV OSC

ΦmΦ

OSC

hjlj ltle= (33)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times20times8 = 320

Fig 26 the flowchart for extracting MOSC

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 35: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

29

2143 Modulation Spectral Contrast of NASE (MASE)

To observe the time-varying behavior of NASE the same modulation spectrum

analysis is applied to the NASE feature values Fig 27 shows the flowchart for

extracting MASE and the detailed steps will be described below

Step 1 Framing and NASE Extraction

Given an input music signal divide the whole music signal into successive

overlapped frames and extract the NASE coefficients of each frame

Step 2 Modulation Spectrum Analysis

Let be the d-th NASE of the i-th frame The

modulation spectrogram is obtained by applying FFT independently on each

feature value along the time trajectory within a texture window of length W

][dNASEi Dd ltle0

0 0 )()(1

0

2

)2( DdWmedNASEdmMW

n

mWnj

nWtt ltleltle= summinus

=

minus

+times

π (35)

where Mt(m d) is the modulation spectrogram for the t-th texture window m

is the modulation frequency index and d is the NASE coefficient index In

the study W is 512 which is about 6 seconds with 50 overlap between

two successive texture windows The representative modulation spectrogram

of a music track is derived by time averaging the magnitude modulation

spectrograms of all texture windows

0 0 )(1)(1

DdWmdmMT

dmMT

tt

NASE ltleltle= sum=

(36)

where T is the total number of texture windows in the music track

Step 3 ContrastValley Determination

The averaged modulation spectrum of each feature value will be

decomposed into J logarithmically spaced modulation subbands(See Table2

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 36: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

30

In the study the number of modulation subbands is 8 (J = 8) The frequency

interval of each modulation subband is shown in Table 24 For each feature

value the modulation spectral peak (MSP) and modulation spectral valley

(MSV) within each modulation subband are then evaluated

( ))(max)(

dmMdjMSP NASE

ΦmΦ

NASE

hjlj ltle= (37)

( ))(min)(

dmMdjMSV NASE

ΦmΦ

NASE

hjlj ltle= (38)

where Φjl and Φjh are respectively the low modulation frequency index and

high modulation frequency index of the j-th modulation subband 0 le j lt J

The MSPs correspond to the dominant rhythmic components and MSVs the

non-rhythmic components in the modulation subbands Therefore the

difference between MSP and MSV will reflect the modulation spectral

contrast distribution

(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=

As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the

modulation spectral contrast information Therefore the feature dimension of

MMFCC is 2times19times8 = 304

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 37: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

31

WindowingAverage

Modulation Spectrum

ContrastValleyDetermination

DFT

NASE extraction

Framing

M1d[m]

M2d[m]

MTd[m]

M3d[m]

MT-1d[m]

MD[m]

NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]

sI[n]sI-1[n]s1[n] s3[n]s2[n]

Music signal

NASE

M1[m]

M2[m]

M3[m]

MD-1[m]

Fig 27 the flowchart for extracting MASE

Table 24 Frequency interval of each modulation subband

Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]

215 Statistical Aggregation of Modulation Spectral Feature Values

Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral

feature value of variant modulation frequency which reflects the beat interval of a

music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to

the same modulation subband of different spectralcepstral feature values(See Fig 29)

To reduce the dimension of the feature space the mean and standard deviation along

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 38: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

32

each row (and each column) of the MSC and MSV matrices will be computed as the

feature values

2151 Statistical Aggregation of MMFCC (SMMFCC)

The modulation spectral feature values derived from the l-th (0 le l lt L) row of

the MSC and MSV matrices of MMFCC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSC ljMSC

Jlu (40)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSC

MFCCMFCCrowMSC luljMSC

Jlσ (41)

)(1)(1

0summinus

=minus =

J

j

MFCCMFCCrowMSV ljMSV

Jlu (42)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

MFCCrowMSV

MFCCMFCCrowMSV luljMSV

Jlσ (43)

Thus the row-based modulation spectral feature vector of a music track is of size 4L

and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

LLuLLu

uuMFCC

rowMSVMFCC

rowMSVMFCC

rowMSCMFCC

rowMSC

MFCCrowMSV

MFCCrowMSV

MFCCrowMSC

MFCCrowMSC

MFCCrow

σσ

σσ Lf (44)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSC ljMSC

Lju (45)

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSC

MFCCMFCCcolMSC juljMSC

Ljσ (46)

)(1)(1

0summinus

=minus =

L

l

MFCCMFCCcolMSV ljMSV

Lju (47)

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 39: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

33

))()((1)(211

0

2 ⎟⎠

⎞⎜⎝

⎛minus= sum

minus

=minusminus

L

l

MFCCcolMSV

MFCCMFCCcolMSV juljMSV

Ljσ (48)

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

)]1( )1( )1( )1(

)0( )0( )0( )0([Tminusminusminusminus

=

minusminusminusminus

minusminusminusminus

JJuJJu

uuMFCC

colMSVMFCC

colMSVMFCC

colMSCMFCC

colMSC

MFCCcolMSV

MFCCcolMSV

MFCCcolMSC

MFCCcolMSC

MFCCcol

σσ

σσ Lf (49)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

size (4D+4J) can be obtained

f MFCC= [( )MFCCrowf T ( )MFCC

colf T]T (50)

In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMMFCC is 80+32 = 112

2152 Statistical Aggregation of MOSC (SMOSC)

The modulation spectral feature values derived from the d-th (0 le d lt D) row of

the MSC and MSV matrices of MOSC can be computed as follows

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSC djMSC

Jdu (51)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSC

OSCOSCrowMSC dudjMSC

Jdσ (52)

)(1)(1

0summinus

=minus =

J

j

OSCOSCrowMSV djMSV

Jdu (53)

))()((1)(21

1

0

2⎟⎟⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minusminus

J

j

OSCrowMSV

OSCOSCrowMSV dudjMSV

Jdσ (54)

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 40: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

34

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD OSCrowMSV

OSCrowMSVrow σ

(55)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuOSCMSC

OSCrowMSC

OSCrowMSV

OSCrowMSV

OSCrowMSC

OSCrowMSC

OSCrow

σ

σσ Lf

)(1 1

0)( sum

minus

=minuscolMSC djMSCju (56) =

D

d

OSCOSC

D

))( 2 ⎟⎠

minus minusOSC

colMSC ju (57) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

OSCOSCcolMSV djMSV

Dju (58)

))() 2 ⎟⎠

minus minusOSC

colMSV ju (59) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

OSCOSCcolMSV djMSV

Djσ

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ OSCcolMSV

OSCcolMSV

OSCcolMSC σσ

(60)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuOSC

colMSC

OSCcolMSV

OSCcolMSV

OSCcolMSC

OSCcolMSC

OSCcol σσ Lf

size (4D+4J) can be obtained

f OSC= [( OSCrowf )T ( OSC

colf )T]T (61)

In summary the row-base

column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 41: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

35

rived from the d-th (0 le d lt D) row of

ows

feature dimension of SMOSC is 80+32 = 112

2153 Statistical Aggregation of MASE (SMASE)

The modulation spectral feature values de

the MSC and MSV matrices of MASE can be computed as foll

)(1)(1

0summinus

=minusrowMSC =

J

j

NASENASE djMSCJ

du (62)

( 2⎟⎟minus NAS

wMSCu (63) )))((1)(21

1

0 ⎠

⎞⎜⎜⎝

⎛= sum

minus

=minusminus

J

j

Ero

NASENASErowMSC ddjMSC

Jdσ

)(1)(1

0summinus

=minus =

J

j

NASENASErowMSV djMSV

Jdu (64)

))() 2⎟⎟minus

NASErowMSV du (65) ((1)(

211

0 ⎠

⎞⎜⎜⎝

⎛minus= sum

minus

=minus

J

j

NASENASErowMSV djMSV

Jdσ

Thus the row-based modulation spectral feature vector of a music track is of size 4D

and can be represented as

)1 Tminusminusminus minusminusminus DDuD NASErowMSV

NASErowMSVrow σ

(66)

Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)

column of the MSC and MSV matrices can be computed as follows

)]1( )1( )1( (

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Du

uuNASEMSC

NASErowMSC

NASErowMSV

NASErowMSV

NASErowMSC

NASErowMSC

NASErow

σ

σσ Lf

)(1)(1

0summinus

=minuscolMSC =

D

d

NASENASE djMSCD

ju (67)

))( 2 ⎟⎠

minus minusNASE

colMSC ju (68) )((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSC ljMSC

Djσ

)(1)(1

0summinus

=minus =

D

d

NASENASEcolMSV djMSV

Dju (69)

))() 2 ⎟⎠

minus minusNASE

colMSV ju (70) ((1)(211

0

⎞⎜⎝

⎛= sum

minus

=minus

D

d

NASENASEcolMSV djMSV

Djσ

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 42: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

36

Thus the column-based modulation spectral feature vector of a music track is of size

4J and can be represented as

Tminusminusminus minusminusminus JJuJ NASEcolMSV

NASEcolMSV

NASEcolMSC σσ

(71)

If the row-based modulation spectral feature vector and column-based

modulation spectral feature vector are combined together a larger feature vector of

d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the

SC r M is

)]1( )1( )1( )1(

)0( )0( )0( )0([

minus

=

minus

minusminusminusminus

Ju

uuNASE

colMSC

NASEcolMSV

NASEcolMSV

NASEcolMSC

NASEcolMSC

NASEcol σσ Lf

size (4D+4J) can be obtained

f NASE= [( NASErowf )T ( NASE

colf )T]T (72)

In summary the row-base

column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the

row-based modulation spectral feature vector and column-based modulation spectral

feature vector will result in a feature vector of length 4L+4J That is the overall

feature dimension of SMASE is 76+32 = 108

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 43: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

37

MSC(1 2) MSV(1 2)

MSC(2 2)MSV(2 2)

MSC(J 2)MSV(J 2)

MSC(2 D) MSV(2 D)

row

row

2

2

σ

μ

Fig 28 the row-based modulation spectral

Fig 29 the column-based modulation spectral

MSC(1D) MSV(1D)

MSC(1 1) MSV(1 1)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 1)MSV(J 1)

rowD

rowD

σ

μ

row

row

1

1

σ

μ

Modulation Frequency

Texture Window Feature

Dimension

MSC(1D) MSV(1D)

MSC(1 2) MSV(1 2)

MSC(1 1) MSV(1 1)

MSC(2 D) MSV(2 D)

MSC(2 2)MSV(2 2)

MSC(2 1)MSV(2 1)

MSC(J D) MSV(J D)

MSC(J 2) MSV(J 2)

MSC(J 1) MSV(J 1)

Modulation Frequency

Feature Dimension

Texture Window

col

col

1

1

σ

μcol

col

2

2

σ

μ

colJ

colJ

σ

μ

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 44: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

38

216 Feature Vector Normalization

In the training phase the representative feature vector for a specific music genre

is derived by averaging the feature vectors for the whole set of training music signals

of the same genre

sum=

=cN

nnc

cc N 1

1 ff (73)

where denotes the feature vector of the n-th music signal belonging to the c-th

music genre

ncf

cf is the representative feature vector for the c-th music genre and Nc

is the number of training music signals belonging to the c-th music genre Since the

dynamic ranges for variant feature values may be different a linear normalization is

applied to get the normalized feature vector cf

)()()()()(ˆ

minmax

min

mfmfmfmfmf c

c minusminus

= Cc lele1 (74)

where C is the number of classes denotes the m-th feature value of the c-th

representative feature vector and denote respectively the

maximum and minimum of the m-th feature values of all training music signals

)(ˆ mfc

)(max mf )(min mf

(75) )(min)(

)(max)(

11min

11max

mfmf

mfmf

cjNjCc

cjNjCc

c

c

lelelele

lelelele

=

=

where denotes the m-th feature value of the j-th training music piece

belonging to the c-th music genre

)(mfcj

22 Linear Discriminant Analysis (LDA)

Linear discriminant analysis (LDA) [26] aims at improving the classification

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 45: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

39

accuracy at a lower dimensional feature vector space LDA deals with the

discrimination between various classes rather than the representation of all classes

The objective of LDA is to minimize the within-class distance while maximize the

between-class distance In LDA an optimal transformation matrix that maps an

H-dimensional feature space to an h-dimensional space (h le H) has to be found in

order to provide higher discriminability among various music classes

Let SW and SB denote the within-class scatter matrix and between-class scatter

matrix respectively The within-class scatter matrix is defined as

)()(1

T

1sumsum

= =

minusminus=C

ccnc

N

ncnc

c

xxxxSW (76)

where xcn is the n-th feature vector labeled as class c cx is the mean vector of class

c C is the total number of music classes and Nc is the number of training vectors

labeled as class c The between-class scatter matrix is given by

))((1

Tsum=

minusminus=C

ccccN xxxxSB (77)

where x is the mean vector of all training vectors The most widely used

transformation matrix is a linear mapping that maximizes the so-called Fisher

criterion JF defined as the ratio of between-class scatter to within-class scatter

(78) ))()(()( T1T ASAASAA BWminus= trJ F

From the above equation we can see that LDA tries to find a transformation matrix

that maximizes the ratio of between-class scatter to within-class scatter in a

lower-dimensional space In this study a whitening procedure is intergrated with LDA

transformation such that the multivariate normal distribution of the set of training

vectors becomes a spherical one [23] First the eigenvalues and corresponding

eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 46: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

40

orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the

corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then

whitening transformed by ΦΛ-12

(79) )( T21 xΦΛx minus=w

It can be shown that the whitened within-class scatter matrix

derived from all the whitened training vectors will

become an identity matrix I Thus the whitened between-class scatter matrix

contains all the discriminative information A

transformation matrix Ψ can be determined by finding the eigenvectors of

Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors

corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the

transformation matrix Ψ Finally the optimal whitened LDA transformation matrix

A

)()( 21T21 minusminus= ΦΛSΦΛS WWw

)()( 21T21 minusminus= ΦΛSΦΛS BBw

wBS

WLDA is defined as

(80) WLDA ΨΦΛA 21minus=

AWLDA will be employed to transform each H-dimensional feature vector to be a lower

h-dimensional vector Let x denote the H-dimensional feature vector the reduced

h-dimensional feature vector can be computed by

(81) T xAy WLDA=

23 Music Genre Classification Phase

In the classification phase the row-based as well as column-based modulation

spectral feature vectors are first extracted from each input music track The same

linear normalization process is applied to each feature value The normalized feature

vector is then transformed to be a lower-dimensional feature vector by using the

whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 47: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

41

transformed feature vector In this study the nearest centroid classifier is used for

music genre classification For the c-th (1 le c le C) music genre the centroid of

whitened LDA transformed feature vectors of all training music tracks labeled as the

c-th music genre is regarded as its representative feature vector

sum=

=cN

nnc

cc N 1

1 yy (82)

where ycn denotes the whitened LDA transformed feature vector of the n-th music

track labeled as the c-th music genre cy is the representative feature vector of the

c-th music genre and Nc is the number of training music tracks labeled as the c-th

music genre The distance between two feature vectors is measured by Euclidean

distance Thus the subject code s that denotes the identified music genre is

determined by finding the representative feature vector that has minimum Euclidean

distance to y

) (minarg1

cCc

ds yylele

= (83)

Chapter 3

Experimental Results

The music database employed in the ISMIR2004 Audio Description Contest [33]

was used for performance comparison The database consists of 1458 music tracks in

which 729 music tracks are used for training and the other 729 tracks for testing The

audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this

study each MP3 audio file is first converted into raw digital audio before

classification These music tracks are classified into six classes (that is C = 6)

Classical Electronic JazzBlue MetalPunk RockPop and World In summary the

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 48: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

42

music tracks used for trainingtesting include 320320 tracks of Classical 115114

tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102

tracks of RockPop and 122122 tracks of World music genre

Since the music tracks per class are not equally distributed the overall accuracy

of correctly classified genres is evaluated as follows

(84) 1sumlele

sdot=Cc

cc CAPCA

where Pc is the probability of appearance of the c-th music genre CAc is the

classification accuracy for the c-th music genre

31 Comparison of row-based modulation spectral feature vector

Table 31 shows the average classification accuracy for each row-based

modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1

denote respectively the row-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1

and the combined feature vector performs the best Table 32 show the corresponding

confusion matrices

Table 31 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC1 7750 SMOSC1 7915 SMASE1 7778

SMMFCC1+SMOSC1+SMASE1 8464

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 49: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

43

Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6

Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14

World 33 8 1 0 4 75 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492

Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148

World 1031 702 385 000 392 6148

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11

Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8

World 23 6 2 0 6 84 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902

Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656

World 719 526 769 000 588 6885

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 50: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

44

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5

Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13

World 28 6 3 1 4 73 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410

Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066

World 875 526 1154 222 392 5984

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9

Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16

World 17 7 1 1 5 86 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738

Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311

World 531 614 385 222 490 7049

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 51: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

45

32 Comparison of column-based modulation spectral feature vector

Table 33 shows the average classification accuracy for each column-based

modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2

denote respectively the column-based modulation spectral feature vector derived form

modulation spectral analysis of MFCC OSC and NASE From table 31 we can see

that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2

which is different from the row-based With the same result the combined feature

vector also get the best performance Table 34 show the corresponding confusion

matrices

Table 33 Averaged classification accuracy (CA ) for row-based modulation

Feature Set CA

SMMFCC2 7064 SMOSC2 6859 SMASE2 7174

SMMFCC2+SMOSC2+SMASE2 7860

Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4

Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19

World 33 10 3 0 9 54 Total 320 114 26 45 102 122

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 52: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

46

(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803

Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557

MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557

World 1031 877 1154 000 882 4426

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6

Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10

World 40 6 2 1 12 51 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492

Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820

World 1250 526 769 222 1176 4180

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2

Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15

World 31 10 7 0 12 54 Total 320 114 26 45 102 122

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 53: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

47

(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377

Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230

MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230

World 969 877 2692 000 1176 4426

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4

Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11

World 27 3 2 1 12 77 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328

Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902

World 844 263 769 222 1176 6311

33 Combination of row-based and column-based modulation

spectral feature vectors

Table 35 shows the average classification accuracy of the combination of

row-based and column-based modulation spectral feature vectors SMMFCC3

SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC

OSC and NASE Comparing this table with Table31 and Table33 we can see that

the combined feature vector will get a better classification performance than each

individual row-based or column-based feature vector Especially the proposed

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 54: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

48

method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of

8532 Table 36 shows the corresponding confusion matrices

Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation

Feature Set CA

SMMFCC3 8038 SMOSC3 8134 SMASE3 8121

SMMFCC3+SMOSC3+SMASE3 8532

Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector

(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5

Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13

World 16 6 3 1 7 80 Total 320 114 26 45 102 122

(a) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410

Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066

World 500 526 1154 222 686 6557

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 55: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

49

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6

Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10

World 20 11 1 2 7 87 Total 320 114 26 45 102 122

(b) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492

Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820

World 625 965 385 444 686 7131

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3

Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8

World 21 4 1 2 7 81 Total 320 114 26 45 102 122

(c) Classic Electronic Jazz MetalPunk PopRock World

Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246

Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656

World 656 351 385 444 686 6639

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 56: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

50

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9

Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11

World 16 6 3 1 6 93 Total 320 114 26 45 102 122

(d) Classic Electronic Jazz MetalPunk PopRock World

Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738

Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902

World 500 526 1154 222 588 7623

Conventional methods use the energy of each modulation subband as the

feature value However we use the modulation spectral contrasts (MSCs) and

modulation spectral valleys (MSVs) computed from each modulation subband as

the feature value Table 37 shows the classification results of these two

approaches From Table 37 we can see that the using MSCs and MSVs have

better performance than the conventional method when row-based and

column-based modulation spectral feature vectors are combined In this table

SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based

column-based and combined feature vectors derived from modulation spectral

analysis of MFCC

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 57: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

51

Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value

Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915

SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519

Chapter 4

Conclusion

A novel feature set derived from modulation spectral analysis of spectral

(OSC and NASE) and cepstral (MFCC) features is proposed for music genre

classification The long-term modulation spectrum analysis is employed to capture the

time-varying behavior of each feature value For each spectralcepstral feature set a

modulation spectrogram will be generated by collecting the modulation spectrum of

all corresponding feature values Modulation spectral contrast (MSC) and modulation

spectral valley (MSV) are then computed from each logarithmically-spaced

modulation subband Statistical aggregations of all MSCs and MSVs are computed to

generate effective and compact discriminating features The music database employed

in the ISMIR2004 Audio Description Contest where all music tracks are classified

into six classes was used for performance comparison If the modulation spectral

features of MFCC OSC and NASE are combined together the classification

accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre

Classification Contest

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 58: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

52

References

[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE

Trans on Speech and Audio Processing 10 (3) (2002) 293-302

[2] T Li M Ogihara Q Li A Comparative study on content-based music genre

classification Proceedings of ACM Conf on Research and Development in

Information Retrieval 2003 pp 282-289

[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification

by spectral contrast feature Proceedings of the IEEE International Conference

on Multimedia amp Expo vol 1 2002 pp 113-116

[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of

musical audio signalsrdquo Proceedings of International Conference on Music

Information Retrieval 2004

[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals

using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)

308-315

[6] M F McKinney J Breebaart Features for audio and music classification

Proceedings of the 4th International Conference on Music Information Retrieval

2003 pp 151-158

[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal

of New Music Research 32 (1) (2003) 83-93

[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre

similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524

[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for

music genre classification IEEE Trans on Audio Speech and Language

Processing 15 (5) (2007) 1654-1664

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 59: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

53

[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic

transformations for music genre classification Proceedings of the 6th

International Conference on Music Information Retrieval 2005 pp 34-41

[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of

audio signals for music genre classification using different ensemble and feature

selection techniques Proceedings of the 5th ACM SIGMM International

Workshop on Multimedia Information Retrieval 2003 pp102-108

[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre

models for analysis and retrieval of music signals IEEE Transactions on

Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005

[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical

genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp

8-11 September 2003

[14] J G A Barbedo and A Lopes Research article automatic genre classification

of musical signals EURASIP Journal on Advances in Signal Processing Vol

2007 pp1-12 June 2006

[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of

IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200

March 2005

[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo

Journal of new musical research Vol 32 No 1 pp 83-93 2003

[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral

basis representation IEEE Trans On Circuits and Systems for Video Technology

14 (5) (2004) 716-725

[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 60: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

54

Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005

[19] W A Sethares R D Robin J C Sethares Beat tracking of musical

performance using low-level audio feature IEEE Trans on Speech and Audio

Processing 13 (12) (2005) 275-285

[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and

Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002

[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis

modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp

708-716 November 2000

[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical

Society of America Vol 102 No 3 pp 1811-1820 September 1997

[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music

content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -

141 Mar 2006

[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using

the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132

1998

[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for

content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10

pp3023-3035 October 2004

[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation

spectrum analysis and its application to music emotion classification in 2006

IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088

July 2006

[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139

Page 61: Automatic music genre classification based on modulation ...chur.chu.edu.tw/bitstream/987654321/1027/1/GM095020290.pdf · 本論文提出了利用調變頻譜分析去觀察長時間的特徵變化,進而從中擷取出

55

indexing and retrieval Wiley 2005

[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000

[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and

summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13

No 3 pp 441-450 May 2005

[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification

and retrieval using joint time-frequency analysis in 2004 IEEE International

Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV

- 665-8 May 2004

[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and

classification using local discriminant basesrdquo IEEE Transactions on Audio

Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007

[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature

selection strategies and ensemble techniques for classifying music Proceedings

of Workshop in Multimedia Discovery and Mining 2003

[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and

Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484

[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of

online learning and an application to boostingrsquo Journal of Computer and System

Sciences 55(1) 119ndash139