Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
中 華 大 學
碩 士 論 文
題目應用頻譜及倒頻譜特徵之調變頻譜分析於
音樂風格之自動分類
Automatic music genre classification based on
modulation spectral analysis of spectral and
cepstral feature
系 所 別資訊工程學系碩士班
學號姓名M09502029 林懷三
指導教授李 建 興 博士
I
摘要
本論文提出了利用調變頻譜分析去觀察長時間的特徵變化進而從中擷取出
其對比特徵首先擷取出整首歌中代表每個音框的特徵向量 (此論文每個音框
擷取出的特徵有 MFCCOSC 與 MPEG-7 NASE)利用調變頻譜去分析音框之
間特徵的變化之後以不同的調變頻帶去切出每個調變頻帶的能量並擷取每個
頻帶的對比特徵在實驗過程中輸入測試的音樂訊號後擷取出所需的特徵後
經過線性正規劃與 LDA 降低維度之後利用歐基理德距離計算測試的訊號與每
一音樂類別的距離最後以距離最短者當作辨識的依據最後實驗數據可以明顯
看出利用調變頻譜分析擷取的特徵要優於傳統利用所有音框的平均向量與標準
差向量當作特徵而最高的辨識率為 8532
II
Abstract With the development of computer networks it becomes more and more popular
to purchase and download digital music from the Internet Since a typical music
database often contains millions of music tracks it is very difficult to manage such a
large music database So that it will be helpful in managing a vast amount of music
tracks when they are properly categorized Therefore a novel feature set derived from
modulation spectrum analysis of Mel-frequency cepstral coefficients (MFCC)
octave-based spectral contrast (OSC) and Normalized Audio Spectral Envelope
(NASE) is proposed for music genre classification The extracted features derived
from modulation spectrum analysis can capture the time-varying behavior of music
signals The experiments results show that the feature vector derived from modulation
spectrum analysis get better performance than that taking the mean and standard
derivation operations In addition apply statistical analysis to the feature values of the
modulation subbands can reduce the feature dimension efficiently The classification
accuracy can be further improved by using linear discriminant analysis (LDA)
whereas the feature dimension is reduced
III
致謝
在就讀研究所的過程中使我對於聲音訊號的領域有些微的了解更重要的
是讓我學會了研究的態度與方法持之恆心毅力與實事求是的精神而讓我理
解到這一深層意義的是我的指導老師 李建興 老師從老師的身上我學會了對
於研究的流程及看問題的角度也讓我感受到突破瓶頸那瞬間的愉悅在論文撰
寫的過程真的要感謝老師多次的閱讀與修改使我的論文順利完成另外也由
衷的感激老師在我期刊發表的期間幫我修正校稿直至半夜老師的辛勞學
生銘記與心此外也要感謝 連振昌 老師韓欽銓 老師石昭玲 老師 和周智
勳 老師在課業上與報告上的指導在此敬上深深的謝意
在此我也要特別感謝 吳翠霞 老師 與 Lilian 老師沒有她們的教導與幫
助就不會有現在的我同時我也要感謝學長 忠茂炳佑清乾正達昭
偉建程家銘與靈逸的教導以及同學 銘輝岳岷佐民雅麟與佑維的互
相支持與幫忙最後也要感謝學弟妹 勝斌正崙偉欣明修信吉琮瑋
仁政蘇峻雅婷佩蓉永坤致娟與堯文的陪伴不論是課業上或生活上都
有難以忘懷的回憶與美好的研究生活
最後要感謝的是我人生的導師在身邊默默支持的父親 林文鈴與細心照
顧我的母親 韓樹珍以及陪我長大的弟弟 懷志在這研究的壓力底下有你們
的鼓勵與支持使我能夠繼續走下去在此僅以此論文獻上我最深深的感謝
IV
CONTENT ABSTRACThelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipII CONTENTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipIV CHAPTER 1helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 INTRODUCTIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1
11 Motivationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 12 Review of music genre classification systemhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip2
121 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 1211 Short-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5
12111 Timbral featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 12112 Rhythmic featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7 12113 Pitch featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7
1212 Long-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12121 Mean and standard deviationhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12122 Autoregressive modelhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip9 12123 Modulation spectrum analysishelliphelliphelliphelliphelliphelliphelliphelliphellip9
122 Linear discriminant analysis (LDA)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10 123 Feature Classifierhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10
13 Outline of Thesishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 CHAPTER 2helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 THE PROPOSED MUSIC GENRE CLASSIFICATION SYSTEMhelliphelliphelliphelliphelliphellip13
21 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip14 211 Mel-Frequency Cepstral Coefficient (MFCC)helliphelliphelliphelliphelliphelliphelliphellip14 212 Octave-based Spectral Contrast (OSC)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip17 213 Normalized Audio Spectral Envelope (NASE)helliphelliphelliphelliphelliphelliphelliphellip19 214 Modulation Spectral Analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip23
2141 Modulation Spectral Contrast of MFCC (MMFCC)helliphelliphellip23 2142 Modulation Spectral Contrast of OSC (MOSC)helliphelliphelliphellip25 2143 Modulation Spectral Contrast of NASE (MASE)helliphelliphelliphellip27
215 Statistical Aggregation of Modulation Spectral Feature valueshellip30 2151 Statistical Aggregation of MMFCC (SMMFCC)helliphelliphelliphellip30 2152 Statistical Aggregation of MOSC (SMOSC)helliphelliphelliphelliphelliphellip32 2153 Statistical Aggregation of MASE (SMASE)helliphelliphelliphelliphelliphellip33
216 Feature vector normalizationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip35 22 Linear discriminant analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip36 23 Music Genre Classificaiton phasehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
V
CHAPTER 3helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39 EXPERIMENT RESULTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
31 Comparison of row-based modulation spectral feature vectorhelliphelliphelliphellip40 32 Comparison of column-based modulation spectral feature vectorhelliphelliphellip42 33 Combination of row-based and column-based modulation spectral feature
vectorshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45 CHAPTER 4helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 CONCLUSIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 REFERENCEhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
1
Chapter 1
Introduction
11 Motivation
With the development of computer networks it becomes more and more popular
to purchase and download digital music from the Internet However a general music
database often contains millions of music tracks Hence it is very difficult to manage
such a large digital music database For this reason it will be helpful to manage a vast
amount of music tracks when they are properly categorized in advance In general the
retail or online music stores often organize their collections of music tracks by
categories such as genre artist and album Usually the category information of a
music track is manually labeled by experienced managers But to determine the
music genre of a music track by experienced managers is a laborious and
time-consuming work Therefore a number of supervised classification techniques
have been developed for automatic classification of unlabeled music tracks
[1-11]Thus in the study we focus on the music genre classification problem which is
defined as genre labeling of music tracks So that an automatic music genre
classification plays an important and preliminary role in music information retrieval
systems A new album or music track can be assigned to a proper genre in order to
place it in the appropriate section of an online music store or music database
To classify the music genre of a given music track some discriminating audio
features have to be extracted through content-based analysis of the music signal In
addition many studies try to examine a set of classifiers to improve the classification
performance However the improvement is limited and ineffective In fact employing
effective feature sets will have much more useful on the classification accuracy than
2
selecting a specific classifier [12] In the study a novel feature set derived from the
row-based and the column-based modulation spectrum analysis will be proposed for
automatic music genre classification
12 Review of Music Genre Classification Systems
The fundamental problem of a music genre classification system is to determine
the structure of the taxonomy that music pieces will be classified into However it is
hard to clearly define a universally agreed structure In general exploiting
hierarchical taxonomy structure for music genre classification has some merits (1)
People often prefer to search music by browsing the hierarchical catalogs (2)
Taxonomy structures identify the relationships or dependence between the music
genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification
approach to improve the classification efficiency and accuracy (3) The classification
errors become more acceptable by using taxonomy than direct music genre
classification The coarse-to-fine approach can make the classification errors
concentrate on a given level of the hierarchy
Burred and Lerch [13] have developed a hierarchical taxonomy for music genre
classification as shown in Fig 11 Rather than making a single decision to classify a
given music into one of all music genres (direct approach) the hierarchical approach
makes successive decisions at each branch point of the taxonomy hierarchy
Additionally appropriate and variant features can be employed at each branch point
of the taxonomy Therefore the hierarchical classification approach allows the
managers to trace at which level the classification errors occur frequently Barbedo
and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The
hierarchical structure was constructed in the bottom-up structure in stead of the
3
top-down structure This is because that it is easily to merge leaf classes into the same
parent class in the bottom-up structure Therefore the upper layer can be easily
constructed In their experiment result the classification accuracy which used the
hierarchical bottom-up approach outperforms the top-down approach by about 3 -
5
Li and Ogihara [15] investigated the effect of two different taxonomy structures
for music genre classification They also proposed an approach to automatic
generation of music genre taxonomies based on the confusion matrix computed by
linear discriminant projection This approach can reduce the time-consuming and
expensive task for manual construction of taxonomies It also helps to look for music
collections in which there are no natural taxonomies [16] According to a given genre
taxonomy many different approaches have been proposed to classify the music genre
for raw music tracks In general a music genre classification system consists of three
major aspects feature extraction feature selection and feature classification Fig 13
shows the block diagram of a music genre classification system
Fig 11 A hierarchical audio taxonomy
4
Fig 12 A hierarchical audio taxonomy
Fig 13 A music genre classification system
5
121 Feature Extraction
1211 Short-term Features
The most important aspect of music genre classification is to determine which
features are relevant and how to extract them Tzanetakis and Cook [1] employed
three feature sets including timbral texture rhythmic content and pitch content to
classify audio collections by their musical genres
12111 Timbral features
Timbral features are generally characterized by the properties related to
instrumentations or sound sources such as music speech or environment signals The
features used to represent timbral texture are described as follows
(1) Low-energy Feature it is defined as the percentage of analysis windows that
have RMS energy less than the average RMS energy across the texture window The
size of texture window should correspond to the minimum amount of time required to
identify a particular music texture
(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It
is defined as
])1[(])[(21 1
0summinus
=
minusminus=N
ntt nxsignnxsignZCR
where the sign function will return 1 for positive input and 0 for negative input and
xt[n] is the time domain signal for frame t
(3) Spectral Centroid spectral centroid is defined as the center of gravity of the
magnitude spectrum
][
][
1
1
sum
sum
=
=
times= N
nt
N
nt
t
nM
nnMC
6
where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the
magnitude of the n-th frequency bin of the t-th frame
(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of
the signal
( )
sum
sum
=
=
timesminus= N
nt
N
ntt
t
nM
nMCnSB
1
1
2
][
][
(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as
the frequency Rt below which 85 of the magnitude distribution is concentrated
sum sum=
minus
=
timesletR
k
N
k
kSkS0
1
0
][850][
(6) Spectral Flux The spectral flux measures the amount of local spectral change It
is defined as the squared difference between the normalized magnitudes of successive
spectral distributions
])[][(1
0
21sum
minus
=minusminus=
N
kttt kNkNSF
where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and
the (t-1)-th frame respectively
(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech
recognition due to their ability to represent the speech spectrum in a compact form In
human auditory system the perceived pitch is not linear with respect to the physical
frequency of the corresponding tone The mapping between the physical frequency
scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz
and logarithmic at higher frequencies In fact MFCC have been proven to be very
effective in automatic speech recognition and in modeling the subjective frequency
content of audio signals
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
I
摘要
本論文提出了利用調變頻譜分析去觀察長時間的特徵變化進而從中擷取出
其對比特徵首先擷取出整首歌中代表每個音框的特徵向量 (此論文每個音框
擷取出的特徵有 MFCCOSC 與 MPEG-7 NASE)利用調變頻譜去分析音框之
間特徵的變化之後以不同的調變頻帶去切出每個調變頻帶的能量並擷取每個
頻帶的對比特徵在實驗過程中輸入測試的音樂訊號後擷取出所需的特徵後
經過線性正規劃與 LDA 降低維度之後利用歐基理德距離計算測試的訊號與每
一音樂類別的距離最後以距離最短者當作辨識的依據最後實驗數據可以明顯
看出利用調變頻譜分析擷取的特徵要優於傳統利用所有音框的平均向量與標準
差向量當作特徵而最高的辨識率為 8532
II
Abstract With the development of computer networks it becomes more and more popular
to purchase and download digital music from the Internet Since a typical music
database often contains millions of music tracks it is very difficult to manage such a
large music database So that it will be helpful in managing a vast amount of music
tracks when they are properly categorized Therefore a novel feature set derived from
modulation spectrum analysis of Mel-frequency cepstral coefficients (MFCC)
octave-based spectral contrast (OSC) and Normalized Audio Spectral Envelope
(NASE) is proposed for music genre classification The extracted features derived
from modulation spectrum analysis can capture the time-varying behavior of music
signals The experiments results show that the feature vector derived from modulation
spectrum analysis get better performance than that taking the mean and standard
derivation operations In addition apply statistical analysis to the feature values of the
modulation subbands can reduce the feature dimension efficiently The classification
accuracy can be further improved by using linear discriminant analysis (LDA)
whereas the feature dimension is reduced
III
致謝
在就讀研究所的過程中使我對於聲音訊號的領域有些微的了解更重要的
是讓我學會了研究的態度與方法持之恆心毅力與實事求是的精神而讓我理
解到這一深層意義的是我的指導老師 李建興 老師從老師的身上我學會了對
於研究的流程及看問題的角度也讓我感受到突破瓶頸那瞬間的愉悅在論文撰
寫的過程真的要感謝老師多次的閱讀與修改使我的論文順利完成另外也由
衷的感激老師在我期刊發表的期間幫我修正校稿直至半夜老師的辛勞學
生銘記與心此外也要感謝 連振昌 老師韓欽銓 老師石昭玲 老師 和周智
勳 老師在課業上與報告上的指導在此敬上深深的謝意
在此我也要特別感謝 吳翠霞 老師 與 Lilian 老師沒有她們的教導與幫
助就不會有現在的我同時我也要感謝學長 忠茂炳佑清乾正達昭
偉建程家銘與靈逸的教導以及同學 銘輝岳岷佐民雅麟與佑維的互
相支持與幫忙最後也要感謝學弟妹 勝斌正崙偉欣明修信吉琮瑋
仁政蘇峻雅婷佩蓉永坤致娟與堯文的陪伴不論是課業上或生活上都
有難以忘懷的回憶與美好的研究生活
最後要感謝的是我人生的導師在身邊默默支持的父親 林文鈴與細心照
顧我的母親 韓樹珍以及陪我長大的弟弟 懷志在這研究的壓力底下有你們
的鼓勵與支持使我能夠繼續走下去在此僅以此論文獻上我最深深的感謝
IV
CONTENT ABSTRACThelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipII CONTENTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipIV CHAPTER 1helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 INTRODUCTIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1
11 Motivationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 12 Review of music genre classification systemhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip2
121 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 1211 Short-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5
12111 Timbral featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 12112 Rhythmic featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7 12113 Pitch featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7
1212 Long-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12121 Mean and standard deviationhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12122 Autoregressive modelhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip9 12123 Modulation spectrum analysishelliphelliphelliphelliphelliphelliphelliphelliphellip9
122 Linear discriminant analysis (LDA)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10 123 Feature Classifierhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10
13 Outline of Thesishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 CHAPTER 2helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 THE PROPOSED MUSIC GENRE CLASSIFICATION SYSTEMhelliphelliphelliphelliphelliphellip13
21 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip14 211 Mel-Frequency Cepstral Coefficient (MFCC)helliphelliphelliphelliphelliphelliphelliphellip14 212 Octave-based Spectral Contrast (OSC)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip17 213 Normalized Audio Spectral Envelope (NASE)helliphelliphelliphelliphelliphelliphelliphellip19 214 Modulation Spectral Analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip23
2141 Modulation Spectral Contrast of MFCC (MMFCC)helliphelliphellip23 2142 Modulation Spectral Contrast of OSC (MOSC)helliphelliphelliphellip25 2143 Modulation Spectral Contrast of NASE (MASE)helliphelliphelliphellip27
215 Statistical Aggregation of Modulation Spectral Feature valueshellip30 2151 Statistical Aggregation of MMFCC (SMMFCC)helliphelliphelliphellip30 2152 Statistical Aggregation of MOSC (SMOSC)helliphelliphelliphelliphelliphellip32 2153 Statistical Aggregation of MASE (SMASE)helliphelliphelliphelliphelliphellip33
216 Feature vector normalizationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip35 22 Linear discriminant analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip36 23 Music Genre Classificaiton phasehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
V
CHAPTER 3helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39 EXPERIMENT RESULTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
31 Comparison of row-based modulation spectral feature vectorhelliphelliphelliphellip40 32 Comparison of column-based modulation spectral feature vectorhelliphelliphellip42 33 Combination of row-based and column-based modulation spectral feature
vectorshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45 CHAPTER 4helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 CONCLUSIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 REFERENCEhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
1
Chapter 1
Introduction
11 Motivation
With the development of computer networks it becomes more and more popular
to purchase and download digital music from the Internet However a general music
database often contains millions of music tracks Hence it is very difficult to manage
such a large digital music database For this reason it will be helpful to manage a vast
amount of music tracks when they are properly categorized in advance In general the
retail or online music stores often organize their collections of music tracks by
categories such as genre artist and album Usually the category information of a
music track is manually labeled by experienced managers But to determine the
music genre of a music track by experienced managers is a laborious and
time-consuming work Therefore a number of supervised classification techniques
have been developed for automatic classification of unlabeled music tracks
[1-11]Thus in the study we focus on the music genre classification problem which is
defined as genre labeling of music tracks So that an automatic music genre
classification plays an important and preliminary role in music information retrieval
systems A new album or music track can be assigned to a proper genre in order to
place it in the appropriate section of an online music store or music database
To classify the music genre of a given music track some discriminating audio
features have to be extracted through content-based analysis of the music signal In
addition many studies try to examine a set of classifiers to improve the classification
performance However the improvement is limited and ineffective In fact employing
effective feature sets will have much more useful on the classification accuracy than
2
selecting a specific classifier [12] In the study a novel feature set derived from the
row-based and the column-based modulation spectrum analysis will be proposed for
automatic music genre classification
12 Review of Music Genre Classification Systems
The fundamental problem of a music genre classification system is to determine
the structure of the taxonomy that music pieces will be classified into However it is
hard to clearly define a universally agreed structure In general exploiting
hierarchical taxonomy structure for music genre classification has some merits (1)
People often prefer to search music by browsing the hierarchical catalogs (2)
Taxonomy structures identify the relationships or dependence between the music
genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification
approach to improve the classification efficiency and accuracy (3) The classification
errors become more acceptable by using taxonomy than direct music genre
classification The coarse-to-fine approach can make the classification errors
concentrate on a given level of the hierarchy
Burred and Lerch [13] have developed a hierarchical taxonomy for music genre
classification as shown in Fig 11 Rather than making a single decision to classify a
given music into one of all music genres (direct approach) the hierarchical approach
makes successive decisions at each branch point of the taxonomy hierarchy
Additionally appropriate and variant features can be employed at each branch point
of the taxonomy Therefore the hierarchical classification approach allows the
managers to trace at which level the classification errors occur frequently Barbedo
and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The
hierarchical structure was constructed in the bottom-up structure in stead of the
3
top-down structure This is because that it is easily to merge leaf classes into the same
parent class in the bottom-up structure Therefore the upper layer can be easily
constructed In their experiment result the classification accuracy which used the
hierarchical bottom-up approach outperforms the top-down approach by about 3 -
5
Li and Ogihara [15] investigated the effect of two different taxonomy structures
for music genre classification They also proposed an approach to automatic
generation of music genre taxonomies based on the confusion matrix computed by
linear discriminant projection This approach can reduce the time-consuming and
expensive task for manual construction of taxonomies It also helps to look for music
collections in which there are no natural taxonomies [16] According to a given genre
taxonomy many different approaches have been proposed to classify the music genre
for raw music tracks In general a music genre classification system consists of three
major aspects feature extraction feature selection and feature classification Fig 13
shows the block diagram of a music genre classification system
Fig 11 A hierarchical audio taxonomy
4
Fig 12 A hierarchical audio taxonomy
Fig 13 A music genre classification system
5
121 Feature Extraction
1211 Short-term Features
The most important aspect of music genre classification is to determine which
features are relevant and how to extract them Tzanetakis and Cook [1] employed
three feature sets including timbral texture rhythmic content and pitch content to
classify audio collections by their musical genres
12111 Timbral features
Timbral features are generally characterized by the properties related to
instrumentations or sound sources such as music speech or environment signals The
features used to represent timbral texture are described as follows
(1) Low-energy Feature it is defined as the percentage of analysis windows that
have RMS energy less than the average RMS energy across the texture window The
size of texture window should correspond to the minimum amount of time required to
identify a particular music texture
(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It
is defined as
])1[(])[(21 1
0summinus
=
minusminus=N
ntt nxsignnxsignZCR
where the sign function will return 1 for positive input and 0 for negative input and
xt[n] is the time domain signal for frame t
(3) Spectral Centroid spectral centroid is defined as the center of gravity of the
magnitude spectrum
][
][
1
1
sum
sum
=
=
times= N
nt
N
nt
t
nM
nnMC
6
where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the
magnitude of the n-th frequency bin of the t-th frame
(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of
the signal
( )
sum
sum
=
=
timesminus= N
nt
N
ntt
t
nM
nMCnSB
1
1
2
][
][
(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as
the frequency Rt below which 85 of the magnitude distribution is concentrated
sum sum=
minus
=
timesletR
k
N
k
kSkS0
1
0
][850][
(6) Spectral Flux The spectral flux measures the amount of local spectral change It
is defined as the squared difference between the normalized magnitudes of successive
spectral distributions
])[][(1
0
21sum
minus
=minusminus=
N
kttt kNkNSF
where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and
the (t-1)-th frame respectively
(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech
recognition due to their ability to represent the speech spectrum in a compact form In
human auditory system the perceived pitch is not linear with respect to the physical
frequency of the corresponding tone The mapping between the physical frequency
scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz
and logarithmic at higher frequencies In fact MFCC have been proven to be very
effective in automatic speech recognition and in modeling the subjective frequency
content of audio signals
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
II
Abstract With the development of computer networks it becomes more and more popular
to purchase and download digital music from the Internet Since a typical music
database often contains millions of music tracks it is very difficult to manage such a
large music database So that it will be helpful in managing a vast amount of music
tracks when they are properly categorized Therefore a novel feature set derived from
modulation spectrum analysis of Mel-frequency cepstral coefficients (MFCC)
octave-based spectral contrast (OSC) and Normalized Audio Spectral Envelope
(NASE) is proposed for music genre classification The extracted features derived
from modulation spectrum analysis can capture the time-varying behavior of music
signals The experiments results show that the feature vector derived from modulation
spectrum analysis get better performance than that taking the mean and standard
derivation operations In addition apply statistical analysis to the feature values of the
modulation subbands can reduce the feature dimension efficiently The classification
accuracy can be further improved by using linear discriminant analysis (LDA)
whereas the feature dimension is reduced
III
致謝
在就讀研究所的過程中使我對於聲音訊號的領域有些微的了解更重要的
是讓我學會了研究的態度與方法持之恆心毅力與實事求是的精神而讓我理
解到這一深層意義的是我的指導老師 李建興 老師從老師的身上我學會了對
於研究的流程及看問題的角度也讓我感受到突破瓶頸那瞬間的愉悅在論文撰
寫的過程真的要感謝老師多次的閱讀與修改使我的論文順利完成另外也由
衷的感激老師在我期刊發表的期間幫我修正校稿直至半夜老師的辛勞學
生銘記與心此外也要感謝 連振昌 老師韓欽銓 老師石昭玲 老師 和周智
勳 老師在課業上與報告上的指導在此敬上深深的謝意
在此我也要特別感謝 吳翠霞 老師 與 Lilian 老師沒有她們的教導與幫
助就不會有現在的我同時我也要感謝學長 忠茂炳佑清乾正達昭
偉建程家銘與靈逸的教導以及同學 銘輝岳岷佐民雅麟與佑維的互
相支持與幫忙最後也要感謝學弟妹 勝斌正崙偉欣明修信吉琮瑋
仁政蘇峻雅婷佩蓉永坤致娟與堯文的陪伴不論是課業上或生活上都
有難以忘懷的回憶與美好的研究生活
最後要感謝的是我人生的導師在身邊默默支持的父親 林文鈴與細心照
顧我的母親 韓樹珍以及陪我長大的弟弟 懷志在這研究的壓力底下有你們
的鼓勵與支持使我能夠繼續走下去在此僅以此論文獻上我最深深的感謝
IV
CONTENT ABSTRACThelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipII CONTENTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipIV CHAPTER 1helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 INTRODUCTIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1
11 Motivationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 12 Review of music genre classification systemhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip2
121 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 1211 Short-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5
12111 Timbral featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 12112 Rhythmic featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7 12113 Pitch featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7
1212 Long-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12121 Mean and standard deviationhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12122 Autoregressive modelhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip9 12123 Modulation spectrum analysishelliphelliphelliphelliphelliphelliphelliphelliphellip9
122 Linear discriminant analysis (LDA)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10 123 Feature Classifierhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10
13 Outline of Thesishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 CHAPTER 2helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 THE PROPOSED MUSIC GENRE CLASSIFICATION SYSTEMhelliphelliphelliphelliphelliphellip13
21 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip14 211 Mel-Frequency Cepstral Coefficient (MFCC)helliphelliphelliphelliphelliphelliphelliphellip14 212 Octave-based Spectral Contrast (OSC)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip17 213 Normalized Audio Spectral Envelope (NASE)helliphelliphelliphelliphelliphelliphelliphellip19 214 Modulation Spectral Analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip23
2141 Modulation Spectral Contrast of MFCC (MMFCC)helliphelliphellip23 2142 Modulation Spectral Contrast of OSC (MOSC)helliphelliphelliphellip25 2143 Modulation Spectral Contrast of NASE (MASE)helliphelliphelliphellip27
215 Statistical Aggregation of Modulation Spectral Feature valueshellip30 2151 Statistical Aggregation of MMFCC (SMMFCC)helliphelliphelliphellip30 2152 Statistical Aggregation of MOSC (SMOSC)helliphelliphelliphelliphelliphellip32 2153 Statistical Aggregation of MASE (SMASE)helliphelliphelliphelliphelliphellip33
216 Feature vector normalizationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip35 22 Linear discriminant analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip36 23 Music Genre Classificaiton phasehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
V
CHAPTER 3helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39 EXPERIMENT RESULTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
31 Comparison of row-based modulation spectral feature vectorhelliphelliphelliphellip40 32 Comparison of column-based modulation spectral feature vectorhelliphelliphellip42 33 Combination of row-based and column-based modulation spectral feature
vectorshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45 CHAPTER 4helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 CONCLUSIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 REFERENCEhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
1
Chapter 1
Introduction
11 Motivation
With the development of computer networks it becomes more and more popular
to purchase and download digital music from the Internet However a general music
database often contains millions of music tracks Hence it is very difficult to manage
such a large digital music database For this reason it will be helpful to manage a vast
amount of music tracks when they are properly categorized in advance In general the
retail or online music stores often organize their collections of music tracks by
categories such as genre artist and album Usually the category information of a
music track is manually labeled by experienced managers But to determine the
music genre of a music track by experienced managers is a laborious and
time-consuming work Therefore a number of supervised classification techniques
have been developed for automatic classification of unlabeled music tracks
[1-11]Thus in the study we focus on the music genre classification problem which is
defined as genre labeling of music tracks So that an automatic music genre
classification plays an important and preliminary role in music information retrieval
systems A new album or music track can be assigned to a proper genre in order to
place it in the appropriate section of an online music store or music database
To classify the music genre of a given music track some discriminating audio
features have to be extracted through content-based analysis of the music signal In
addition many studies try to examine a set of classifiers to improve the classification
performance However the improvement is limited and ineffective In fact employing
effective feature sets will have much more useful on the classification accuracy than
2
selecting a specific classifier [12] In the study a novel feature set derived from the
row-based and the column-based modulation spectrum analysis will be proposed for
automatic music genre classification
12 Review of Music Genre Classification Systems
The fundamental problem of a music genre classification system is to determine
the structure of the taxonomy that music pieces will be classified into However it is
hard to clearly define a universally agreed structure In general exploiting
hierarchical taxonomy structure for music genre classification has some merits (1)
People often prefer to search music by browsing the hierarchical catalogs (2)
Taxonomy structures identify the relationships or dependence between the music
genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification
approach to improve the classification efficiency and accuracy (3) The classification
errors become more acceptable by using taxonomy than direct music genre
classification The coarse-to-fine approach can make the classification errors
concentrate on a given level of the hierarchy
Burred and Lerch [13] have developed a hierarchical taxonomy for music genre
classification as shown in Fig 11 Rather than making a single decision to classify a
given music into one of all music genres (direct approach) the hierarchical approach
makes successive decisions at each branch point of the taxonomy hierarchy
Additionally appropriate and variant features can be employed at each branch point
of the taxonomy Therefore the hierarchical classification approach allows the
managers to trace at which level the classification errors occur frequently Barbedo
and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The
hierarchical structure was constructed in the bottom-up structure in stead of the
3
top-down structure This is because that it is easily to merge leaf classes into the same
parent class in the bottom-up structure Therefore the upper layer can be easily
constructed In their experiment result the classification accuracy which used the
hierarchical bottom-up approach outperforms the top-down approach by about 3 -
5
Li and Ogihara [15] investigated the effect of two different taxonomy structures
for music genre classification They also proposed an approach to automatic
generation of music genre taxonomies based on the confusion matrix computed by
linear discriminant projection This approach can reduce the time-consuming and
expensive task for manual construction of taxonomies It also helps to look for music
collections in which there are no natural taxonomies [16] According to a given genre
taxonomy many different approaches have been proposed to classify the music genre
for raw music tracks In general a music genre classification system consists of three
major aspects feature extraction feature selection and feature classification Fig 13
shows the block diagram of a music genre classification system
Fig 11 A hierarchical audio taxonomy
4
Fig 12 A hierarchical audio taxonomy
Fig 13 A music genre classification system
5
121 Feature Extraction
1211 Short-term Features
The most important aspect of music genre classification is to determine which
features are relevant and how to extract them Tzanetakis and Cook [1] employed
three feature sets including timbral texture rhythmic content and pitch content to
classify audio collections by their musical genres
12111 Timbral features
Timbral features are generally characterized by the properties related to
instrumentations or sound sources such as music speech or environment signals The
features used to represent timbral texture are described as follows
(1) Low-energy Feature it is defined as the percentage of analysis windows that
have RMS energy less than the average RMS energy across the texture window The
size of texture window should correspond to the minimum amount of time required to
identify a particular music texture
(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It
is defined as
])1[(])[(21 1
0summinus
=
minusminus=N
ntt nxsignnxsignZCR
where the sign function will return 1 for positive input and 0 for negative input and
xt[n] is the time domain signal for frame t
(3) Spectral Centroid spectral centroid is defined as the center of gravity of the
magnitude spectrum
][
][
1
1
sum
sum
=
=
times= N
nt
N
nt
t
nM
nnMC
6
where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the
magnitude of the n-th frequency bin of the t-th frame
(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of
the signal
( )
sum
sum
=
=
timesminus= N
nt
N
ntt
t
nM
nMCnSB
1
1
2
][
][
(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as
the frequency Rt below which 85 of the magnitude distribution is concentrated
sum sum=
minus
=
timesletR
k
N
k
kSkS0
1
0
][850][
(6) Spectral Flux The spectral flux measures the amount of local spectral change It
is defined as the squared difference between the normalized magnitudes of successive
spectral distributions
])[][(1
0
21sum
minus
=minusminus=
N
kttt kNkNSF
where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and
the (t-1)-th frame respectively
(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech
recognition due to their ability to represent the speech spectrum in a compact form In
human auditory system the perceived pitch is not linear with respect to the physical
frequency of the corresponding tone The mapping between the physical frequency
scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz
and logarithmic at higher frequencies In fact MFCC have been proven to be very
effective in automatic speech recognition and in modeling the subjective frequency
content of audio signals
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
III
致謝
在就讀研究所的過程中使我對於聲音訊號的領域有些微的了解更重要的
是讓我學會了研究的態度與方法持之恆心毅力與實事求是的精神而讓我理
解到這一深層意義的是我的指導老師 李建興 老師從老師的身上我學會了對
於研究的流程及看問題的角度也讓我感受到突破瓶頸那瞬間的愉悅在論文撰
寫的過程真的要感謝老師多次的閱讀與修改使我的論文順利完成另外也由
衷的感激老師在我期刊發表的期間幫我修正校稿直至半夜老師的辛勞學
生銘記與心此外也要感謝 連振昌 老師韓欽銓 老師石昭玲 老師 和周智
勳 老師在課業上與報告上的指導在此敬上深深的謝意
在此我也要特別感謝 吳翠霞 老師 與 Lilian 老師沒有她們的教導與幫
助就不會有現在的我同時我也要感謝學長 忠茂炳佑清乾正達昭
偉建程家銘與靈逸的教導以及同學 銘輝岳岷佐民雅麟與佑維的互
相支持與幫忙最後也要感謝學弟妹 勝斌正崙偉欣明修信吉琮瑋
仁政蘇峻雅婷佩蓉永坤致娟與堯文的陪伴不論是課業上或生活上都
有難以忘懷的回憶與美好的研究生活
最後要感謝的是我人生的導師在身邊默默支持的父親 林文鈴與細心照
顧我的母親 韓樹珍以及陪我長大的弟弟 懷志在這研究的壓力底下有你們
的鼓勵與支持使我能夠繼續走下去在此僅以此論文獻上我最深深的感謝
IV
CONTENT ABSTRACThelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipII CONTENTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipIV CHAPTER 1helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 INTRODUCTIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1
11 Motivationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 12 Review of music genre classification systemhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip2
121 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 1211 Short-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5
12111 Timbral featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 12112 Rhythmic featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7 12113 Pitch featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7
1212 Long-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12121 Mean and standard deviationhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12122 Autoregressive modelhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip9 12123 Modulation spectrum analysishelliphelliphelliphelliphelliphelliphelliphelliphellip9
122 Linear discriminant analysis (LDA)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10 123 Feature Classifierhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10
13 Outline of Thesishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 CHAPTER 2helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 THE PROPOSED MUSIC GENRE CLASSIFICATION SYSTEMhelliphelliphelliphelliphelliphellip13
21 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip14 211 Mel-Frequency Cepstral Coefficient (MFCC)helliphelliphelliphelliphelliphelliphelliphellip14 212 Octave-based Spectral Contrast (OSC)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip17 213 Normalized Audio Spectral Envelope (NASE)helliphelliphelliphelliphelliphelliphelliphellip19 214 Modulation Spectral Analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip23
2141 Modulation Spectral Contrast of MFCC (MMFCC)helliphelliphellip23 2142 Modulation Spectral Contrast of OSC (MOSC)helliphelliphelliphellip25 2143 Modulation Spectral Contrast of NASE (MASE)helliphelliphelliphellip27
215 Statistical Aggregation of Modulation Spectral Feature valueshellip30 2151 Statistical Aggregation of MMFCC (SMMFCC)helliphelliphelliphellip30 2152 Statistical Aggregation of MOSC (SMOSC)helliphelliphelliphelliphelliphellip32 2153 Statistical Aggregation of MASE (SMASE)helliphelliphelliphelliphelliphellip33
216 Feature vector normalizationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip35 22 Linear discriminant analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip36 23 Music Genre Classificaiton phasehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
V
CHAPTER 3helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39 EXPERIMENT RESULTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
31 Comparison of row-based modulation spectral feature vectorhelliphelliphelliphellip40 32 Comparison of column-based modulation spectral feature vectorhelliphelliphellip42 33 Combination of row-based and column-based modulation spectral feature
vectorshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45 CHAPTER 4helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 CONCLUSIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 REFERENCEhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
1
Chapter 1
Introduction
11 Motivation
With the development of computer networks it becomes more and more popular
to purchase and download digital music from the Internet However a general music
database often contains millions of music tracks Hence it is very difficult to manage
such a large digital music database For this reason it will be helpful to manage a vast
amount of music tracks when they are properly categorized in advance In general the
retail or online music stores often organize their collections of music tracks by
categories such as genre artist and album Usually the category information of a
music track is manually labeled by experienced managers But to determine the
music genre of a music track by experienced managers is a laborious and
time-consuming work Therefore a number of supervised classification techniques
have been developed for automatic classification of unlabeled music tracks
[1-11]Thus in the study we focus on the music genre classification problem which is
defined as genre labeling of music tracks So that an automatic music genre
classification plays an important and preliminary role in music information retrieval
systems A new album or music track can be assigned to a proper genre in order to
place it in the appropriate section of an online music store or music database
To classify the music genre of a given music track some discriminating audio
features have to be extracted through content-based analysis of the music signal In
addition many studies try to examine a set of classifiers to improve the classification
performance However the improvement is limited and ineffective In fact employing
effective feature sets will have much more useful on the classification accuracy than
2
selecting a specific classifier [12] In the study a novel feature set derived from the
row-based and the column-based modulation spectrum analysis will be proposed for
automatic music genre classification
12 Review of Music Genre Classification Systems
The fundamental problem of a music genre classification system is to determine
the structure of the taxonomy that music pieces will be classified into However it is
hard to clearly define a universally agreed structure In general exploiting
hierarchical taxonomy structure for music genre classification has some merits (1)
People often prefer to search music by browsing the hierarchical catalogs (2)
Taxonomy structures identify the relationships or dependence between the music
genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification
approach to improve the classification efficiency and accuracy (3) The classification
errors become more acceptable by using taxonomy than direct music genre
classification The coarse-to-fine approach can make the classification errors
concentrate on a given level of the hierarchy
Burred and Lerch [13] have developed a hierarchical taxonomy for music genre
classification as shown in Fig 11 Rather than making a single decision to classify a
given music into one of all music genres (direct approach) the hierarchical approach
makes successive decisions at each branch point of the taxonomy hierarchy
Additionally appropriate and variant features can be employed at each branch point
of the taxonomy Therefore the hierarchical classification approach allows the
managers to trace at which level the classification errors occur frequently Barbedo
and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The
hierarchical structure was constructed in the bottom-up structure in stead of the
3
top-down structure This is because that it is easily to merge leaf classes into the same
parent class in the bottom-up structure Therefore the upper layer can be easily
constructed In their experiment result the classification accuracy which used the
hierarchical bottom-up approach outperforms the top-down approach by about 3 -
5
Li and Ogihara [15] investigated the effect of two different taxonomy structures
for music genre classification They also proposed an approach to automatic
generation of music genre taxonomies based on the confusion matrix computed by
linear discriminant projection This approach can reduce the time-consuming and
expensive task for manual construction of taxonomies It also helps to look for music
collections in which there are no natural taxonomies [16] According to a given genre
taxonomy many different approaches have been proposed to classify the music genre
for raw music tracks In general a music genre classification system consists of three
major aspects feature extraction feature selection and feature classification Fig 13
shows the block diagram of a music genre classification system
Fig 11 A hierarchical audio taxonomy
4
Fig 12 A hierarchical audio taxonomy
Fig 13 A music genre classification system
5
121 Feature Extraction
1211 Short-term Features
The most important aspect of music genre classification is to determine which
features are relevant and how to extract them Tzanetakis and Cook [1] employed
three feature sets including timbral texture rhythmic content and pitch content to
classify audio collections by their musical genres
12111 Timbral features
Timbral features are generally characterized by the properties related to
instrumentations or sound sources such as music speech or environment signals The
features used to represent timbral texture are described as follows
(1) Low-energy Feature it is defined as the percentage of analysis windows that
have RMS energy less than the average RMS energy across the texture window The
size of texture window should correspond to the minimum amount of time required to
identify a particular music texture
(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It
is defined as
])1[(])[(21 1
0summinus
=
minusminus=N
ntt nxsignnxsignZCR
where the sign function will return 1 for positive input and 0 for negative input and
xt[n] is the time domain signal for frame t
(3) Spectral Centroid spectral centroid is defined as the center of gravity of the
magnitude spectrum
][
][
1
1
sum
sum
=
=
times= N
nt
N
nt
t
nM
nnMC
6
where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the
magnitude of the n-th frequency bin of the t-th frame
(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of
the signal
( )
sum
sum
=
=
timesminus= N
nt
N
ntt
t
nM
nMCnSB
1
1
2
][
][
(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as
the frequency Rt below which 85 of the magnitude distribution is concentrated
sum sum=
minus
=
timesletR
k
N
k
kSkS0
1
0
][850][
(6) Spectral Flux The spectral flux measures the amount of local spectral change It
is defined as the squared difference between the normalized magnitudes of successive
spectral distributions
])[][(1
0
21sum
minus
=minusminus=
N
kttt kNkNSF
where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and
the (t-1)-th frame respectively
(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech
recognition due to their ability to represent the speech spectrum in a compact form In
human auditory system the perceived pitch is not linear with respect to the physical
frequency of the corresponding tone The mapping between the physical frequency
scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz
and logarithmic at higher frequencies In fact MFCC have been proven to be very
effective in automatic speech recognition and in modeling the subjective frequency
content of audio signals
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
IV
CONTENT ABSTRACThelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipII CONTENTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellipIV CHAPTER 1helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 INTRODUCTIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1
11 Motivationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip1 12 Review of music genre classification systemhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip2
121 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 1211 Short-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5
12111 Timbral featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip5 12112 Rhythmic featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7 12113 Pitch featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip7
1212 Long-term featureshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12121 Mean and standard deviationhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip8 12122 Autoregressive modelhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip9 12123 Modulation spectrum analysishelliphelliphelliphelliphelliphelliphelliphelliphellip9
122 Linear discriminant analysis (LDA)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10 123 Feature Classifierhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip10
13 Outline of Thesishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 CHAPTER 2helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip13 THE PROPOSED MUSIC GENRE CLASSIFICATION SYSTEMhelliphelliphelliphelliphelliphellip13
21 Feature Extractionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip14 211 Mel-Frequency Cepstral Coefficient (MFCC)helliphelliphelliphelliphelliphelliphelliphellip14 212 Octave-based Spectral Contrast (OSC)helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip17 213 Normalized Audio Spectral Envelope (NASE)helliphelliphelliphelliphelliphelliphelliphellip19 214 Modulation Spectral Analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip23
2141 Modulation Spectral Contrast of MFCC (MMFCC)helliphelliphellip23 2142 Modulation Spectral Contrast of OSC (MOSC)helliphelliphelliphellip25 2143 Modulation Spectral Contrast of NASE (MASE)helliphelliphelliphellip27
215 Statistical Aggregation of Modulation Spectral Feature valueshellip30 2151 Statistical Aggregation of MMFCC (SMMFCC)helliphelliphelliphellip30 2152 Statistical Aggregation of MOSC (SMOSC)helliphelliphelliphelliphelliphellip32 2153 Statistical Aggregation of MASE (SMASE)helliphelliphelliphelliphelliphellip33
216 Feature vector normalizationhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip35 22 Linear discriminant analysishelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip36 23 Music Genre Classificaiton phasehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
V
CHAPTER 3helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39 EXPERIMENT RESULTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
31 Comparison of row-based modulation spectral feature vectorhelliphelliphelliphellip40 32 Comparison of column-based modulation spectral feature vectorhelliphelliphellip42 33 Combination of row-based and column-based modulation spectral feature
vectorshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45 CHAPTER 4helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 CONCLUSIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 REFERENCEhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
1
Chapter 1
Introduction
11 Motivation
With the development of computer networks it becomes more and more popular
to purchase and download digital music from the Internet However a general music
database often contains millions of music tracks Hence it is very difficult to manage
such a large digital music database For this reason it will be helpful to manage a vast
amount of music tracks when they are properly categorized in advance In general the
retail or online music stores often organize their collections of music tracks by
categories such as genre artist and album Usually the category information of a
music track is manually labeled by experienced managers But to determine the
music genre of a music track by experienced managers is a laborious and
time-consuming work Therefore a number of supervised classification techniques
have been developed for automatic classification of unlabeled music tracks
[1-11]Thus in the study we focus on the music genre classification problem which is
defined as genre labeling of music tracks So that an automatic music genre
classification plays an important and preliminary role in music information retrieval
systems A new album or music track can be assigned to a proper genre in order to
place it in the appropriate section of an online music store or music database
To classify the music genre of a given music track some discriminating audio
features have to be extracted through content-based analysis of the music signal In
addition many studies try to examine a set of classifiers to improve the classification
performance However the improvement is limited and ineffective In fact employing
effective feature sets will have much more useful on the classification accuracy than
2
selecting a specific classifier [12] In the study a novel feature set derived from the
row-based and the column-based modulation spectrum analysis will be proposed for
automatic music genre classification
12 Review of Music Genre Classification Systems
The fundamental problem of a music genre classification system is to determine
the structure of the taxonomy that music pieces will be classified into However it is
hard to clearly define a universally agreed structure In general exploiting
hierarchical taxonomy structure for music genre classification has some merits (1)
People often prefer to search music by browsing the hierarchical catalogs (2)
Taxonomy structures identify the relationships or dependence between the music
genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification
approach to improve the classification efficiency and accuracy (3) The classification
errors become more acceptable by using taxonomy than direct music genre
classification The coarse-to-fine approach can make the classification errors
concentrate on a given level of the hierarchy
Burred and Lerch [13] have developed a hierarchical taxonomy for music genre
classification as shown in Fig 11 Rather than making a single decision to classify a
given music into one of all music genres (direct approach) the hierarchical approach
makes successive decisions at each branch point of the taxonomy hierarchy
Additionally appropriate and variant features can be employed at each branch point
of the taxonomy Therefore the hierarchical classification approach allows the
managers to trace at which level the classification errors occur frequently Barbedo
and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The
hierarchical structure was constructed in the bottom-up structure in stead of the
3
top-down structure This is because that it is easily to merge leaf classes into the same
parent class in the bottom-up structure Therefore the upper layer can be easily
constructed In their experiment result the classification accuracy which used the
hierarchical bottom-up approach outperforms the top-down approach by about 3 -
5
Li and Ogihara [15] investigated the effect of two different taxonomy structures
for music genre classification They also proposed an approach to automatic
generation of music genre taxonomies based on the confusion matrix computed by
linear discriminant projection This approach can reduce the time-consuming and
expensive task for manual construction of taxonomies It also helps to look for music
collections in which there are no natural taxonomies [16] According to a given genre
taxonomy many different approaches have been proposed to classify the music genre
for raw music tracks In general a music genre classification system consists of three
major aspects feature extraction feature selection and feature classification Fig 13
shows the block diagram of a music genre classification system
Fig 11 A hierarchical audio taxonomy
4
Fig 12 A hierarchical audio taxonomy
Fig 13 A music genre classification system
5
121 Feature Extraction
1211 Short-term Features
The most important aspect of music genre classification is to determine which
features are relevant and how to extract them Tzanetakis and Cook [1] employed
three feature sets including timbral texture rhythmic content and pitch content to
classify audio collections by their musical genres
12111 Timbral features
Timbral features are generally characterized by the properties related to
instrumentations or sound sources such as music speech or environment signals The
features used to represent timbral texture are described as follows
(1) Low-energy Feature it is defined as the percentage of analysis windows that
have RMS energy less than the average RMS energy across the texture window The
size of texture window should correspond to the minimum amount of time required to
identify a particular music texture
(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It
is defined as
])1[(])[(21 1
0summinus
=
minusminus=N
ntt nxsignnxsignZCR
where the sign function will return 1 for positive input and 0 for negative input and
xt[n] is the time domain signal for frame t
(3) Spectral Centroid spectral centroid is defined as the center of gravity of the
magnitude spectrum
][
][
1
1
sum
sum
=
=
times= N
nt
N
nt
t
nM
nnMC
6
where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the
magnitude of the n-th frequency bin of the t-th frame
(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of
the signal
( )
sum
sum
=
=
timesminus= N
nt
N
ntt
t
nM
nMCnSB
1
1
2
][
][
(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as
the frequency Rt below which 85 of the magnitude distribution is concentrated
sum sum=
minus
=
timesletR
k
N
k
kSkS0
1
0
][850][
(6) Spectral Flux The spectral flux measures the amount of local spectral change It
is defined as the squared difference between the normalized magnitudes of successive
spectral distributions
])[][(1
0
21sum
minus
=minusminus=
N
kttt kNkNSF
where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and
the (t-1)-th frame respectively
(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech
recognition due to their ability to represent the speech spectrum in a compact form In
human auditory system the perceived pitch is not linear with respect to the physical
frequency of the corresponding tone The mapping between the physical frequency
scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz
and logarithmic at higher frequencies In fact MFCC have been proven to be very
effective in automatic speech recognition and in modeling the subjective frequency
content of audio signals
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
V
CHAPTER 3helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39 EXPERIMENT RESULTShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
31 Comparison of row-based modulation spectral feature vectorhelliphelliphelliphellip40 32 Comparison of column-based modulation spectral feature vectorhelliphelliphellip42 33 Combination of row-based and column-based modulation spectral feature
vectorshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45 CHAPTER 4helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 CONCLUSIONhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48 REFERENCEhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
1
Chapter 1
Introduction
11 Motivation
With the development of computer networks it becomes more and more popular
to purchase and download digital music from the Internet However a general music
database often contains millions of music tracks Hence it is very difficult to manage
such a large digital music database For this reason it will be helpful to manage a vast
amount of music tracks when they are properly categorized in advance In general the
retail or online music stores often organize their collections of music tracks by
categories such as genre artist and album Usually the category information of a
music track is manually labeled by experienced managers But to determine the
music genre of a music track by experienced managers is a laborious and
time-consuming work Therefore a number of supervised classification techniques
have been developed for automatic classification of unlabeled music tracks
[1-11]Thus in the study we focus on the music genre classification problem which is
defined as genre labeling of music tracks So that an automatic music genre
classification plays an important and preliminary role in music information retrieval
systems A new album or music track can be assigned to a proper genre in order to
place it in the appropriate section of an online music store or music database
To classify the music genre of a given music track some discriminating audio
features have to be extracted through content-based analysis of the music signal In
addition many studies try to examine a set of classifiers to improve the classification
performance However the improvement is limited and ineffective In fact employing
effective feature sets will have much more useful on the classification accuracy than
2
selecting a specific classifier [12] In the study a novel feature set derived from the
row-based and the column-based modulation spectrum analysis will be proposed for
automatic music genre classification
12 Review of Music Genre Classification Systems
The fundamental problem of a music genre classification system is to determine
the structure of the taxonomy that music pieces will be classified into However it is
hard to clearly define a universally agreed structure In general exploiting
hierarchical taxonomy structure for music genre classification has some merits (1)
People often prefer to search music by browsing the hierarchical catalogs (2)
Taxonomy structures identify the relationships or dependence between the music
genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification
approach to improve the classification efficiency and accuracy (3) The classification
errors become more acceptable by using taxonomy than direct music genre
classification The coarse-to-fine approach can make the classification errors
concentrate on a given level of the hierarchy
Burred and Lerch [13] have developed a hierarchical taxonomy for music genre
classification as shown in Fig 11 Rather than making a single decision to classify a
given music into one of all music genres (direct approach) the hierarchical approach
makes successive decisions at each branch point of the taxonomy hierarchy
Additionally appropriate and variant features can be employed at each branch point
of the taxonomy Therefore the hierarchical classification approach allows the
managers to trace at which level the classification errors occur frequently Barbedo
and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The
hierarchical structure was constructed in the bottom-up structure in stead of the
3
top-down structure This is because that it is easily to merge leaf classes into the same
parent class in the bottom-up structure Therefore the upper layer can be easily
constructed In their experiment result the classification accuracy which used the
hierarchical bottom-up approach outperforms the top-down approach by about 3 -
5
Li and Ogihara [15] investigated the effect of two different taxonomy structures
for music genre classification They also proposed an approach to automatic
generation of music genre taxonomies based on the confusion matrix computed by
linear discriminant projection This approach can reduce the time-consuming and
expensive task for manual construction of taxonomies It also helps to look for music
collections in which there are no natural taxonomies [16] According to a given genre
taxonomy many different approaches have been proposed to classify the music genre
for raw music tracks In general a music genre classification system consists of three
major aspects feature extraction feature selection and feature classification Fig 13
shows the block diagram of a music genre classification system
Fig 11 A hierarchical audio taxonomy
4
Fig 12 A hierarchical audio taxonomy
Fig 13 A music genre classification system
5
121 Feature Extraction
1211 Short-term Features
The most important aspect of music genre classification is to determine which
features are relevant and how to extract them Tzanetakis and Cook [1] employed
three feature sets including timbral texture rhythmic content and pitch content to
classify audio collections by their musical genres
12111 Timbral features
Timbral features are generally characterized by the properties related to
instrumentations or sound sources such as music speech or environment signals The
features used to represent timbral texture are described as follows
(1) Low-energy Feature it is defined as the percentage of analysis windows that
have RMS energy less than the average RMS energy across the texture window The
size of texture window should correspond to the minimum amount of time required to
identify a particular music texture
(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It
is defined as
])1[(])[(21 1
0summinus
=
minusminus=N
ntt nxsignnxsignZCR
where the sign function will return 1 for positive input and 0 for negative input and
xt[n] is the time domain signal for frame t
(3) Spectral Centroid spectral centroid is defined as the center of gravity of the
magnitude spectrum
][
][
1
1
sum
sum
=
=
times= N
nt
N
nt
t
nM
nnMC
6
where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the
magnitude of the n-th frequency bin of the t-th frame
(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of
the signal
( )
sum
sum
=
=
timesminus= N
nt
N
ntt
t
nM
nMCnSB
1
1
2
][
][
(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as
the frequency Rt below which 85 of the magnitude distribution is concentrated
sum sum=
minus
=
timesletR
k
N
k
kSkS0
1
0
][850][
(6) Spectral Flux The spectral flux measures the amount of local spectral change It
is defined as the squared difference between the normalized magnitudes of successive
spectral distributions
])[][(1
0
21sum
minus
=minusminus=
N
kttt kNkNSF
where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and
the (t-1)-th frame respectively
(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech
recognition due to their ability to represent the speech spectrum in a compact form In
human auditory system the perceived pitch is not linear with respect to the physical
frequency of the corresponding tone The mapping between the physical frequency
scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz
and logarithmic at higher frequencies In fact MFCC have been proven to be very
effective in automatic speech recognition and in modeling the subjective frequency
content of audio signals
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
1
Chapter 1
Introduction
11 Motivation
With the development of computer networks it becomes more and more popular
to purchase and download digital music from the Internet However a general music
database often contains millions of music tracks Hence it is very difficult to manage
such a large digital music database For this reason it will be helpful to manage a vast
amount of music tracks when they are properly categorized in advance In general the
retail or online music stores often organize their collections of music tracks by
categories such as genre artist and album Usually the category information of a
music track is manually labeled by experienced managers But to determine the
music genre of a music track by experienced managers is a laborious and
time-consuming work Therefore a number of supervised classification techniques
have been developed for automatic classification of unlabeled music tracks
[1-11]Thus in the study we focus on the music genre classification problem which is
defined as genre labeling of music tracks So that an automatic music genre
classification plays an important and preliminary role in music information retrieval
systems A new album or music track can be assigned to a proper genre in order to
place it in the appropriate section of an online music store or music database
To classify the music genre of a given music track some discriminating audio
features have to be extracted through content-based analysis of the music signal In
addition many studies try to examine a set of classifiers to improve the classification
performance However the improvement is limited and ineffective In fact employing
effective feature sets will have much more useful on the classification accuracy than
2
selecting a specific classifier [12] In the study a novel feature set derived from the
row-based and the column-based modulation spectrum analysis will be proposed for
automatic music genre classification
12 Review of Music Genre Classification Systems
The fundamental problem of a music genre classification system is to determine
the structure of the taxonomy that music pieces will be classified into However it is
hard to clearly define a universally agreed structure In general exploiting
hierarchical taxonomy structure for music genre classification has some merits (1)
People often prefer to search music by browsing the hierarchical catalogs (2)
Taxonomy structures identify the relationships or dependence between the music
genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification
approach to improve the classification efficiency and accuracy (3) The classification
errors become more acceptable by using taxonomy than direct music genre
classification The coarse-to-fine approach can make the classification errors
concentrate on a given level of the hierarchy
Burred and Lerch [13] have developed a hierarchical taxonomy for music genre
classification as shown in Fig 11 Rather than making a single decision to classify a
given music into one of all music genres (direct approach) the hierarchical approach
makes successive decisions at each branch point of the taxonomy hierarchy
Additionally appropriate and variant features can be employed at each branch point
of the taxonomy Therefore the hierarchical classification approach allows the
managers to trace at which level the classification errors occur frequently Barbedo
and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The
hierarchical structure was constructed in the bottom-up structure in stead of the
3
top-down structure This is because that it is easily to merge leaf classes into the same
parent class in the bottom-up structure Therefore the upper layer can be easily
constructed In their experiment result the classification accuracy which used the
hierarchical bottom-up approach outperforms the top-down approach by about 3 -
5
Li and Ogihara [15] investigated the effect of two different taxonomy structures
for music genre classification They also proposed an approach to automatic
generation of music genre taxonomies based on the confusion matrix computed by
linear discriminant projection This approach can reduce the time-consuming and
expensive task for manual construction of taxonomies It also helps to look for music
collections in which there are no natural taxonomies [16] According to a given genre
taxonomy many different approaches have been proposed to classify the music genre
for raw music tracks In general a music genre classification system consists of three
major aspects feature extraction feature selection and feature classification Fig 13
shows the block diagram of a music genre classification system
Fig 11 A hierarchical audio taxonomy
4
Fig 12 A hierarchical audio taxonomy
Fig 13 A music genre classification system
5
121 Feature Extraction
1211 Short-term Features
The most important aspect of music genre classification is to determine which
features are relevant and how to extract them Tzanetakis and Cook [1] employed
three feature sets including timbral texture rhythmic content and pitch content to
classify audio collections by their musical genres
12111 Timbral features
Timbral features are generally characterized by the properties related to
instrumentations or sound sources such as music speech or environment signals The
features used to represent timbral texture are described as follows
(1) Low-energy Feature it is defined as the percentage of analysis windows that
have RMS energy less than the average RMS energy across the texture window The
size of texture window should correspond to the minimum amount of time required to
identify a particular music texture
(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It
is defined as
])1[(])[(21 1
0summinus
=
minusminus=N
ntt nxsignnxsignZCR
where the sign function will return 1 for positive input and 0 for negative input and
xt[n] is the time domain signal for frame t
(3) Spectral Centroid spectral centroid is defined as the center of gravity of the
magnitude spectrum
][
][
1
1
sum
sum
=
=
times= N
nt
N
nt
t
nM
nnMC
6
where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the
magnitude of the n-th frequency bin of the t-th frame
(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of
the signal
( )
sum
sum
=
=
timesminus= N
nt
N
ntt
t
nM
nMCnSB
1
1
2
][
][
(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as
the frequency Rt below which 85 of the magnitude distribution is concentrated
sum sum=
minus
=
timesletR
k
N
k
kSkS0
1
0
][850][
(6) Spectral Flux The spectral flux measures the amount of local spectral change It
is defined as the squared difference between the normalized magnitudes of successive
spectral distributions
])[][(1
0
21sum
minus
=minusminus=
N
kttt kNkNSF
where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and
the (t-1)-th frame respectively
(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech
recognition due to their ability to represent the speech spectrum in a compact form In
human auditory system the perceived pitch is not linear with respect to the physical
frequency of the corresponding tone The mapping between the physical frequency
scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz
and logarithmic at higher frequencies In fact MFCC have been proven to be very
effective in automatic speech recognition and in modeling the subjective frequency
content of audio signals
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
2
selecting a specific classifier [12] In the study a novel feature set derived from the
row-based and the column-based modulation spectrum analysis will be proposed for
automatic music genre classification
12 Review of Music Genre Classification Systems
The fundamental problem of a music genre classification system is to determine
the structure of the taxonomy that music pieces will be classified into However it is
hard to clearly define a universally agreed structure In general exploiting
hierarchical taxonomy structure for music genre classification has some merits (1)
People often prefer to search music by browsing the hierarchical catalogs (2)
Taxonomy structures identify the relationships or dependence between the music
genres Thus hierarchical taxonomy structures provide a coarse-to-fine classification
approach to improve the classification efficiency and accuracy (3) The classification
errors become more acceptable by using taxonomy than direct music genre
classification The coarse-to-fine approach can make the classification errors
concentrate on a given level of the hierarchy
Burred and Lerch [13] have developed a hierarchical taxonomy for music genre
classification as shown in Fig 11 Rather than making a single decision to classify a
given music into one of all music genres (direct approach) the hierarchical approach
makes successive decisions at each branch point of the taxonomy hierarchy
Additionally appropriate and variant features can be employed at each branch point
of the taxonomy Therefore the hierarchical classification approach allows the
managers to trace at which level the classification errors occur frequently Barbedo
and Lopes [14] have also defined a hierarchical taxonomy as shown in Fig 12 The
hierarchical structure was constructed in the bottom-up structure in stead of the
3
top-down structure This is because that it is easily to merge leaf classes into the same
parent class in the bottom-up structure Therefore the upper layer can be easily
constructed In their experiment result the classification accuracy which used the
hierarchical bottom-up approach outperforms the top-down approach by about 3 -
5
Li and Ogihara [15] investigated the effect of two different taxonomy structures
for music genre classification They also proposed an approach to automatic
generation of music genre taxonomies based on the confusion matrix computed by
linear discriminant projection This approach can reduce the time-consuming and
expensive task for manual construction of taxonomies It also helps to look for music
collections in which there are no natural taxonomies [16] According to a given genre
taxonomy many different approaches have been proposed to classify the music genre
for raw music tracks In general a music genre classification system consists of three
major aspects feature extraction feature selection and feature classification Fig 13
shows the block diagram of a music genre classification system
Fig 11 A hierarchical audio taxonomy
4
Fig 12 A hierarchical audio taxonomy
Fig 13 A music genre classification system
5
121 Feature Extraction
1211 Short-term Features
The most important aspect of music genre classification is to determine which
features are relevant and how to extract them Tzanetakis and Cook [1] employed
three feature sets including timbral texture rhythmic content and pitch content to
classify audio collections by their musical genres
12111 Timbral features
Timbral features are generally characterized by the properties related to
instrumentations or sound sources such as music speech or environment signals The
features used to represent timbral texture are described as follows
(1) Low-energy Feature it is defined as the percentage of analysis windows that
have RMS energy less than the average RMS energy across the texture window The
size of texture window should correspond to the minimum amount of time required to
identify a particular music texture
(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It
is defined as
])1[(])[(21 1
0summinus
=
minusminus=N
ntt nxsignnxsignZCR
where the sign function will return 1 for positive input and 0 for negative input and
xt[n] is the time domain signal for frame t
(3) Spectral Centroid spectral centroid is defined as the center of gravity of the
magnitude spectrum
][
][
1
1
sum
sum
=
=
times= N
nt
N
nt
t
nM
nnMC
6
where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the
magnitude of the n-th frequency bin of the t-th frame
(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of
the signal
( )
sum
sum
=
=
timesminus= N
nt
N
ntt
t
nM
nMCnSB
1
1
2
][
][
(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as
the frequency Rt below which 85 of the magnitude distribution is concentrated
sum sum=
minus
=
timesletR
k
N
k
kSkS0
1
0
][850][
(6) Spectral Flux The spectral flux measures the amount of local spectral change It
is defined as the squared difference between the normalized magnitudes of successive
spectral distributions
])[][(1
0
21sum
minus
=minusminus=
N
kttt kNkNSF
where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and
the (t-1)-th frame respectively
(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech
recognition due to their ability to represent the speech spectrum in a compact form In
human auditory system the perceived pitch is not linear with respect to the physical
frequency of the corresponding tone The mapping between the physical frequency
scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz
and logarithmic at higher frequencies In fact MFCC have been proven to be very
effective in automatic speech recognition and in modeling the subjective frequency
content of audio signals
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
3
top-down structure This is because that it is easily to merge leaf classes into the same
parent class in the bottom-up structure Therefore the upper layer can be easily
constructed In their experiment result the classification accuracy which used the
hierarchical bottom-up approach outperforms the top-down approach by about 3 -
5
Li and Ogihara [15] investigated the effect of two different taxonomy structures
for music genre classification They also proposed an approach to automatic
generation of music genre taxonomies based on the confusion matrix computed by
linear discriminant projection This approach can reduce the time-consuming and
expensive task for manual construction of taxonomies It also helps to look for music
collections in which there are no natural taxonomies [16] According to a given genre
taxonomy many different approaches have been proposed to classify the music genre
for raw music tracks In general a music genre classification system consists of three
major aspects feature extraction feature selection and feature classification Fig 13
shows the block diagram of a music genre classification system
Fig 11 A hierarchical audio taxonomy
4
Fig 12 A hierarchical audio taxonomy
Fig 13 A music genre classification system
5
121 Feature Extraction
1211 Short-term Features
The most important aspect of music genre classification is to determine which
features are relevant and how to extract them Tzanetakis and Cook [1] employed
three feature sets including timbral texture rhythmic content and pitch content to
classify audio collections by their musical genres
12111 Timbral features
Timbral features are generally characterized by the properties related to
instrumentations or sound sources such as music speech or environment signals The
features used to represent timbral texture are described as follows
(1) Low-energy Feature it is defined as the percentage of analysis windows that
have RMS energy less than the average RMS energy across the texture window The
size of texture window should correspond to the minimum amount of time required to
identify a particular music texture
(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It
is defined as
])1[(])[(21 1
0summinus
=
minusminus=N
ntt nxsignnxsignZCR
where the sign function will return 1 for positive input and 0 for negative input and
xt[n] is the time domain signal for frame t
(3) Spectral Centroid spectral centroid is defined as the center of gravity of the
magnitude spectrum
][
][
1
1
sum
sum
=
=
times= N
nt
N
nt
t
nM
nnMC
6
where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the
magnitude of the n-th frequency bin of the t-th frame
(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of
the signal
( )
sum
sum
=
=
timesminus= N
nt
N
ntt
t
nM
nMCnSB
1
1
2
][
][
(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as
the frequency Rt below which 85 of the magnitude distribution is concentrated
sum sum=
minus
=
timesletR
k
N
k
kSkS0
1
0
][850][
(6) Spectral Flux The spectral flux measures the amount of local spectral change It
is defined as the squared difference between the normalized magnitudes of successive
spectral distributions
])[][(1
0
21sum
minus
=minusminus=
N
kttt kNkNSF
where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and
the (t-1)-th frame respectively
(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech
recognition due to their ability to represent the speech spectrum in a compact form In
human auditory system the perceived pitch is not linear with respect to the physical
frequency of the corresponding tone The mapping between the physical frequency
scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz
and logarithmic at higher frequencies In fact MFCC have been proven to be very
effective in automatic speech recognition and in modeling the subjective frequency
content of audio signals
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
4
Fig 12 A hierarchical audio taxonomy
Fig 13 A music genre classification system
5
121 Feature Extraction
1211 Short-term Features
The most important aspect of music genre classification is to determine which
features are relevant and how to extract them Tzanetakis and Cook [1] employed
three feature sets including timbral texture rhythmic content and pitch content to
classify audio collections by their musical genres
12111 Timbral features
Timbral features are generally characterized by the properties related to
instrumentations or sound sources such as music speech or environment signals The
features used to represent timbral texture are described as follows
(1) Low-energy Feature it is defined as the percentage of analysis windows that
have RMS energy less than the average RMS energy across the texture window The
size of texture window should correspond to the minimum amount of time required to
identify a particular music texture
(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It
is defined as
])1[(])[(21 1
0summinus
=
minusminus=N
ntt nxsignnxsignZCR
where the sign function will return 1 for positive input and 0 for negative input and
xt[n] is the time domain signal for frame t
(3) Spectral Centroid spectral centroid is defined as the center of gravity of the
magnitude spectrum
][
][
1
1
sum
sum
=
=
times= N
nt
N
nt
t
nM
nnMC
6
where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the
magnitude of the n-th frequency bin of the t-th frame
(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of
the signal
( )
sum
sum
=
=
timesminus= N
nt
N
ntt
t
nM
nMCnSB
1
1
2
][
][
(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as
the frequency Rt below which 85 of the magnitude distribution is concentrated
sum sum=
minus
=
timesletR
k
N
k
kSkS0
1
0
][850][
(6) Spectral Flux The spectral flux measures the amount of local spectral change It
is defined as the squared difference between the normalized magnitudes of successive
spectral distributions
])[][(1
0
21sum
minus
=minusminus=
N
kttt kNkNSF
where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and
the (t-1)-th frame respectively
(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech
recognition due to their ability to represent the speech spectrum in a compact form In
human auditory system the perceived pitch is not linear with respect to the physical
frequency of the corresponding tone The mapping between the physical frequency
scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz
and logarithmic at higher frequencies In fact MFCC have been proven to be very
effective in automatic speech recognition and in modeling the subjective frequency
content of audio signals
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
5
121 Feature Extraction
1211 Short-term Features
The most important aspect of music genre classification is to determine which
features are relevant and how to extract them Tzanetakis and Cook [1] employed
three feature sets including timbral texture rhythmic content and pitch content to
classify audio collections by their musical genres
12111 Timbral features
Timbral features are generally characterized by the properties related to
instrumentations or sound sources such as music speech or environment signals The
features used to represent timbral texture are described as follows
(1) Low-energy Feature it is defined as the percentage of analysis windows that
have RMS energy less than the average RMS energy across the texture window The
size of texture window should correspond to the minimum amount of time required to
identify a particular music texture
(2) Zero-Crossing Rate (ZCR) ZCR provides a measure of noisiness of the signal It
is defined as
])1[(])[(21 1
0summinus
=
minusminus=N
ntt nxsignnxsignZCR
where the sign function will return 1 for positive input and 0 for negative input and
xt[n] is the time domain signal for frame t
(3) Spectral Centroid spectral centroid is defined as the center of gravity of the
magnitude spectrum
][
][
1
1
sum
sum
=
=
times= N
nt
N
nt
t
nM
nnMC
6
where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the
magnitude of the n-th frequency bin of the t-th frame
(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of
the signal
( )
sum
sum
=
=
timesminus= N
nt
N
ntt
t
nM
nMCnSB
1
1
2
][
][
(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as
the frequency Rt below which 85 of the magnitude distribution is concentrated
sum sum=
minus
=
timesletR
k
N
k
kSkS0
1
0
][850][
(6) Spectral Flux The spectral flux measures the amount of local spectral change It
is defined as the squared difference between the normalized magnitudes of successive
spectral distributions
])[][(1
0
21sum
minus
=minusminus=
N
kttt kNkNSF
where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and
the (t-1)-th frame respectively
(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech
recognition due to their ability to represent the speech spectrum in a compact form In
human auditory system the perceived pitch is not linear with respect to the physical
frequency of the corresponding tone The mapping between the physical frequency
scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz
and logarithmic at higher frequencies In fact MFCC have been proven to be very
effective in automatic speech recognition and in modeling the subjective frequency
content of audio signals
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
6
where N is the length of the short-time Fourier transform (STFT) and Mt[n] is the
magnitude of the n-th frequency bin of the t-th frame
(4) Spectral Bandwidth spectral bandwidth determines the frequency bandwidth of
the signal
( )
sum
sum
=
=
timesminus= N
nt
N
ntt
t
nM
nMCnSB
1
1
2
][
][
(5) Spectral Roll-off spectral roll-off is a measure of spectral shape It is defined as
the frequency Rt below which 85 of the magnitude distribution is concentrated
sum sum=
minus
=
timesletR
k
N
k
kSkS0
1
0
][850][
(6) Spectral Flux The spectral flux measures the amount of local spectral change It
is defined as the squared difference between the normalized magnitudes of successive
spectral distributions
])[][(1
0
21sum
minus
=minusminus=
N
kttt kNkNSF
where Nt[n] and Nt-1[n] are the normalized magnitude spectrum of the t-th frame and
the (t-1)-th frame respectively
(7) Mel-Frequency Cepstral Coefficients MFCC have been widely used for speech
recognition due to their ability to represent the speech spectrum in a compact form In
human auditory system the perceived pitch is not linear with respect to the physical
frequency of the corresponding tone The mapping between the physical frequency
scale (Hz) and perceived frequency scale (mel) is approximately linear below 1k Hz
and logarithmic at higher frequencies In fact MFCC have been proven to be very
effective in automatic speech recognition and in modeling the subjective frequency
content of audio signals
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
7
(8) Octave-based spectral contrast (OSC) OSC was developed to represent the
spectral characteristics of a music piece [3] This feature describes the strength of
spectral peaks and spectral valleys in each sub-band separately It can roughly reflect
the distribution of harmonic and non-harmonic components
(9) Normalized audio spectral envelope(NASE) NASE was referred to the MPEG-7
standard[17] First the audio spectral envelope(ASE) is obtained from the sum of the
log power spectrum in each logarithmic subband Then each ASE coefficient is
normalized with the Root Mean Square(RMS) energy yielding a normalized version
of the ASE called NASE
12112 Rhythmic features
The features representing the rhythmic content of a music piece are mainly
derived from the beat histogram including the overall beat strength the main beat and
its strength the period of the main beat and subbeats the relative strength of subbeats
to main beat Many beat-tracking algorithms [18 19] providing an estimate of the
main beat and the corresponding strength have been proposed
12113 Pitch features
Tzanetakis et al [20] extracted pitch features from the pitch histograms of a
music piece The extracted pitch features contain frequency pitch strength and pitch
interval The pitch histogram can be estimated by multiple pitch detection techniques
[21 22] Melody and harmony have been widely used by musicologists to study the
musical structures Scaringella et al [23] proposed a method to extract melody and
harmony features by characterizing the pitch distribution of a short segment like most
melodyharmony analyzers The main difference is that no fundamental frequency
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
8
chord key or other high-level feature has to determine in advance
1212 Long-term Features
To find the representative feature vector of a whole music piece the methods
employed to integrate the short-term features into a long-term feature include mean
and standard deviation autoregressive model [9] modulation spectrum analysis [24
25 26] and nonlinear time series analysis
12121 Mean and standard deviation
The mean and standard deviation operation is the most used method to integrate
the short-term features Let xi = [xi[0] xi[1] hellip xi[D-1]]T denote the representative
D-dimensional feature vector of the i-th frame The mean and standard deviation is
calculated as follow
summinus
=
=1
0][1][
T
ii dx
Tdμ 10 minuslele Dd
211
0
2 ]])[][(1[][ summinus
=
minus=T
ii ddx
Td μσ 10 minuslele Dd
where T is the number of frames of the input signal This statistical method exhibits
no information about the relationship between features as well as the time-varying
behavior of music signals
12122 Autoregressive model (AR model)
Meng et al [9] used AR model to analyze the time-varying texture of music
signals They proposed the diagonal autoregressive (DAR) and multivariate
autoregressive (MAR) analysis to integrate the short-term features In DAR each
short-term feature is independently modeled by an AR model The extracted feature
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
9
vector includes the mean and variance of all short-term feature vectors as well as the
coefficients of each AR model In MAR all short-term features are modeled by a
MAR model The difference between MAR model and AR model is that MAR
considers the relationship between features The features used in MAR include the
mean vector the covariance matrix of all shorter-term feature vectors and the
coefficients of the MAR model In addition for a p-order MAR model the feature
dimension is p times D times D where D is the feature dimension of a short-term feature
vector
12123 Modulation spectrum analysis
The idea of modulation spectrum analysis is to model the frequency variability of
signals along the time axis Kingsbury et al [24] first employed modulation
spectrogram for speech recognition It has been shown that the most sensitive
modulation frequency to human audition is about 4 Hz Sukittanon et al [25] used
modulation spectrum analysis for music content identification They showed that
modulation-scale features along with subband normalization are insensitive to
convolutional noise Shi et al [26] used modulation spectrum analysis to model the
long-term characteristics of music signals in order to extract the tempo feature for
music emotion classification
12124 Nonlinear time series analysis
Non-linear analysis of time series offers an alternative way to describe temporal
structure which is complementary to the analysis of linear correlation and spectral
properties Mierswa and Morik [27] used the reconstructed phase space to extract
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
10
features directly from the audio data The mean and standard deviations of the
distances and angles in the phase space with an embedding dimension of two and unit
time lag were used
122 Linear discriminant analysis (LDA)
LDA [28] aims at improving the classification accuracy at a lower dimensional
feature vector space LDA deals with discrimination between classes rather than
representations of various classes The goal of LDA is to minimize the within-class
distance while maximize the between-class distance In LDA an optimal
transformation matrix from an n-dimensional feature space to d-dimensional space is
determined where d le n The transformation should enhance the separability among
different classes The optimal transformation matrix can be exploited to map each
n-dimensional feature vector into a d-dimensional vector The detailed steps will be
described in Chapter 2
In LDA each class is generally modeled by a single Gaussian distribution In fact the
music signal is too complexity to be modeled by a single Gaussian distribution In
addition the same transformation matrix of LDA is used for all the classes which
doesnrsquot consider the class-wise differences
122 Feature Classifier
Tzanetakis and Cook [1] combined timbral features rhythmic features and pitch
features with GMM classifier to their music genre classification system The
hierarchical genres adopted in their music classification system are Classical Country
Disco Hip-Hop Jazz Rock Blues Reggae Pop and Metal In Classical the
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
11
sub-genres contain Choir Orchestra Piano and String Quarter In Jazz the
sub-genres contain BigBand Cool Fusion Piano Quarter and Swing The
experiment result shows that GMM with three components achieves the best
classification accuracy
West and Cox [4] constructed a hierarchical framed based music genre
classification system In their classification system a majority vote is taken to decide
the final classification The genres adopted in their music classification system are
Rock Classical Heavy Metal Drum Bass Reggae and Jungle They take MFCC
and OSC as features and compare the performance withwithout decision tree
classifier of Gaussian classifier GMM with three components and LDA In their
experiment the feature vector with GMM classifier and decision tree classifier has
best accuracy 8279
Xu et al [29] applied SVM to discriminate between pure music and vocal one
The SVM learning algorithm is applied to obtain the classification parameters
according to the calculated features It is demonstrated that SVM achieves better
performance than traditional Euclidean distance methods and hidden Markov model
(HMM) methods
Esmaili et al [30] use some low-level features (MFCC entropy centroid
bandwidth etc) and LDA for music genre classification In their system the
classification accuracy is 930 for the classification of five music genres Rock
Classical Folk Jazz and Pop
Bagci and Erzin [8] constructed a novel frame-based music genre classification
system In their classification system some invalid frames are first detected and
discarded for classification purpose To determine whether a frame is valid or not a
GMM model is constructed for each music genre These GMM models are then used
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
12
to sift the frames which are unable to be correctly classified and each GMM model of
a music genre is updated for each correctly classified frame Moreover a GMM
model is employed to represent the invalid frames In their experiment the feature
vector includes 13 MFCC 4 spectral shape features (spectral centroid spectral
roll-off spectral flux and zero-crossing rate) as well as the first- and second-order
derivative of these timbral features Their musical genre dataset includes ten genre
types Blues Classical Country Disco Hip-hop Jazz Metal Pop Reggae and Rock
The classification accuracy can up to 8860 when the frame length is 30s and each
GMM is modeled by 48 Gaussian distributions
Umapathy et al [31] used local discriminant bases (LDB) technique to measure
the dissimilarity of the LDB nodes of any two classes and extract features from these
high-dissimilarity LDB nodes First they use the wavelet packet tree decomposition
to construct a five-level tree for a music signal Then two novel features the energy
distribution over frequencies (D1) and nonstationarity index (D2) are used to measure
the dissimilarity of the LDB nodes of any two classes In their classification system
the feature dimension is 30 including the energies and variances of the basis vector
coefficients of the first 15 high dissimilarity nodes The experiment results show that
when the LDB feature vector is combined with MFCC and by using LDA analysis
the average classification accuracy for the first level is 91 (artificial and natural
sounds) for the second level is 99 (instrumental and automobile human and
nonhuman) and 95 for the third level (drums flute and piano aircraft and
helicopter male and female speech animals birds and insects)
Grimaldi et al [11 32] used a set of features based on discrete wavelet packet
transform (DWPT) to represent a music track Discrete wavelet transform (DWT) is a
well-known signal analysis methodology able to approximate a real signal at different
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
13
scales both in time and frequency domain Taking into account the non-stationary
nature of the input signal the DWT provides an approximation with excellent time
and frequency resolution WPT is a variant of DWT which is achieved by recursively
convolving the input signal with a pair of low pass and high pass filters Unlike DWT
that recursively decomposes only the low-pass subband the WPDT decomposes both
bands at each level
Bergatra et al [33] used AdaBoost for music classification AdaBoost is an
ensemble (or meta-learning) method that constructs a classifier in an iterative fashion
[34] It was originally designed for binary classification and was later extended to
multiclass classification using several different strategies
13 Outline of Thesis
In Chapter 2 the proposed method for music genre classification will be
introduced In Chapter 3 some experiments will be presented to show the
effectiveness of the proposed method Finally conclusion will be given in Chapter 4
Chapter 2
The Proposed Music Genre Classification Method
The proposed music genre classification system consists of two phases the
training phase and the classification phase The training phase is composed of two
main modules feature extraction and linear discriminant analysis (LDA) The
classification phase consists of three modules feature extraction LDA transformation
and classification The block diagram of the proposed music genre classification
system is the same as that shown in Fig 12 A detailed description of each module
will be described below
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
14
21 Feature Extraction
A novel feature set derived from modulation spectral analysis of the spectral
(OSC and NASE) as well as cepstral (MFCC) feature trajectories of music signals is
proposed for music genre classification
211 Mel-Frequency Cepstral Coefficients (MFCC)
MFCC have been widely used for speech recognition due to their ability to
represent the speech spectrum in a compact form In fact MFCC have been proven to
be very effective in automatic speech recognition and in modeling the subjective
frequency content of audio signals Fig 21 is a flowchart for extracting MFCC from
an input signal The detailed steps will be given below
Step 1 Pre-emphasis
]1[ˆ][][ˆ minustimesminus= nsansns (1)
where s[n] is the current sample and s[nminus1] is the previous sample a typical
value for is 095 a
Step 2 Framing
Each music signal is divided into a set of overlapped frames (frame size = N
samples) Each pair of consecutive frames is overlapped M samples
Step 3 Windowing
Each frame is multiplied by a Hamming window
][ ][ˆ][~ nwnsns ii = 10 minuslele Nn (2)
where the Hamming window function w[n] is defined as
)1
2cos( 460540][minus
minus=N
nnw π 10 minuslele Nn (3)
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
15
Step 4 Spectral Analysis
Take the discrete Fourier transform of each frame using FFT
][~][1
0
2
summinus
=
minus=
N
n
nNkj
ii enskXπ
10 minuslele Nk (4)
where k is the frequency index
Step 5 Mel-scale Band-Pass Filtering
The spectrum is then decomposed into a number of subbands by using a set
of Mel-scale band-pass filters
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (5)
Where B is the total number of filters (B is 25 in the study) Ibl and Ibh
denote respectively the low-frequency index and high-frequency index of the
b-th band-pass filter Ai[k] is the squared amplitude of Xi[k] that is
|][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (6)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter as shown in Table 21
Step 6 Discrete cosine transform (DCT)
MFCC can be obtained by applying DCT on the logarithm of E(b)
0 ))50(cos( ))(1(log)(1
010 Llb
BlbElMFCC
B
bi ltle++= sum
minus
=
π (7)
where L is the length of MFCC feature vector (L is 20 in the study)
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
16
Therefore the MFCC feature vector can be represented as follows
xMFCC = [MFCC(0) MFCC(1) hellip MFCC(L-1)]T (8)
Fig 21 The flowchart for computing MFCC
Pre-emphasis
Input Signal
Framing
Windowing
FFT
Mel-scale band-pass filtering
DCT
MFCC
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
17
Table 21 The range of each triangular band-pass filter
Filter number Frequency interval (Hz) 0 (0 200] 1 (100 300] 2 (200 400] 3 (300 500] 4 (400 600] 5 (500 700] 6 (600 800] 7 (700 900] 8 (800 1000] 9 (900 1149] 10 (1000 1320] 11 (1149 1516] 12 (1320 1741] 13 (1516 2000] 14 (1741 2297] 15 (2000 2639] 16 (2297 3031] 17 (2639 3482] 18 (3031 4000] 19 (3482 4595] 20 (4000 5278] 20 (4595 6063] 22 (5278 6964] 23 (6063 8000] 24 (6964 9190]
212 Octave-based Spectral Contrast (OSC)
OSC was developed to represent the spectral characteristics of a music signal It
considers the spectral peak and valley in each subband independently In general
spectral peaks correspond to harmonic components and spectral valleys the
non-harmonic components or noise in music signals Therefore the difference
between spectral peaks and spectral valleys will reflect the spectral contrast
distribution Fig 22 shows the block diagram for extracting the OSC feature The
detailed steps will be described below
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
18
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and FFT is then to obtain the corresponding spectrum of each frame
Step 2 Octave Scale Filtering
This spectrum is then divided into a number of subbands by the set of octave
scale filters shown in Table 22 The octave scale filtering operation can be
described as follows
][)(
sum=
=hb
lb
I
Ikii kAbE 120 0 minusleleltle NkBb (9)
where B is the number of subbands Ibl and Ibh denote respectively the
low-frequency index and high-frequency index of the b-th band-pass filter
Ai[k] is the squared amplitude of Xi[k] that is |][|][ 2kXkA ii =
Ibl and Ibh are given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (10)
where fs is the sampling frequency fbl and fbh are the low frequency and high
frequency of the b-th band-pass filter
Step 3 Peak Valley Selection
Let (Mb1 Mb2 hellip MbNb) denote the magnitude spectrum within the b-th
subband Nb is the number of FFT frequency bins in the b-th subband
Without loss of generality let the magnitude spectrum be sorted in a
decreasing order that is Mb1 ge Mb2 ge hellip ge MbNb The spectral peak and
spectral valley in the b-th subband are then estimated as follows
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
19
)1log()(1
sum=
=bN
iib
b
MN
bPeakα
α (11)
)1log()(1
1sum=
+minus=b
b
N
iiNb
b
MN
bValleyα
α (12)
where α is a neighborhood factor (α is 02 in this study) The spectral
contrast is given by the difference between the spectral peak and the spectral
valley
)( )()( bValleybPeakbSC minus= (13)
The feature vector of an audio frame consists of the spectral contrasts and the
spectral valleys of all subbands Thus the OSC feature vector of an audio frame can
be represented as follows
xOSC = [Valley(0) hellip Valley(B-1) SC(0) hellip SC(B-1)]T (14)
Input Signal
Framing
Octave scale filtering
PeakValley Selection
Spectral Contrast
OSC
FFT
Fig 22 The flowchart for computing OSC
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
20
Table 22 The range of each octave-scale band-pass filter (Sampling rate = 441 kHz)
Filter number Frequency interval (Hz)0 [0 0] 1 (0 100] 2 (100 200] 3 (200 400] 4 (400 800] 5 (800 1600] 6 (1600 3200] 7 (3200 6400] 8 (6400 12800] 9 (12800 22050)
213 Normalized Audio Spectral Envelope (NASE)
NASE was defined in MPEG-7 for sound classification The NASE descriptor
provides a representation of the power spectrum of each audio frame Each
component of the NASE feature vector represents the normalized magnitude of a
particular frequency subband Fig 23 shows the block diagram for extracting the
NASE feature For a given music piece the main steps for computing NASE are
described as follow
Step 1 Framing and Spectral Analysis
An input music signal is divided into a number of successive overlapped
frames and each audio frame is multiplied by a Hamming window function
and analyzed using FFT to derive its spectrum notated X(k) 1 le k le N
where N is the size of FFT The power spectrum is defined as the normalized
squared magnitude of the DFT spectrum X(k)
20|)(|2
2 0|)(|1
)(2
2
⎪⎪⎩
⎪⎪⎨
⎧
ltltsdot
=sdot=
NkkXEN
NkkXENkP
w
w (15)
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
21
where Ew is the energy of the Hamming window function w(n) of size Nw
|)(|1
0
2summinus
=
=wN
nw nwE (16)
Step 2 Subband Decomposition
The power spectrum is divided into logarithmically spaced subbands
spanning between 625 Hz (ldquoloEdgerdquo) and 16 kHz (ldquohiEdgerdquo) over a
spectrum of 8 octave interval (see Fig24) The NASE scale filtering
operation can be described as follows(see Table 23)
)()(
sum=
=hb
lb
I
Ikii kPbASE 120 0 minusleleltle NkBb
(17)
where B is the number of logarithmic subbands within the frequency range
[loEdge hiEdge] and is given by B = 8r and r is the spectral resolution of
the frequency subbands ranging from 116 of an octave to 8 octaves(B=16
r=12 in the study)
(18) 34 octaves 2 leleminus= jr j
Ibl and Ibh are the low-frequency index and high-frequency index of the b-th
band-pass filter given as
)(
)(
Nff
I
Nff
I
s
hbhb
s
lblb
=
= (19)
where fs is the sampling frequency fbl and fbh are the low frequency and
high frequency of the b-th band-pass filter
Step 3 Normalized Audio Spectral Envelope
The ASE coefficient for the b-th subband is defined as the sum of power
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
22
spectrum coefficients within this subband
(20) 10 )()(
+lele= sum=
BbkPbASEhb
lb
I
Ik
Each ASE coefficient is then converted to the decibel scale
10 ))((log 10)( 10 +lele= BbbASEbASEdB (21)
The NASE coefficient is derived by normalizing each decibel-scale ASE
coefficient with the root-mean-square (RMS) norm gain value R
10 )()( +lele= BbR
bASEbNASE dB (22)
where the RMS-norm gain value R is defined as
))((1
0
2sum+
=
=B
bdB bASER (23)
In MPEG-7 the ASE coefficients consist of one coefficient representing power
between 0 Hz and loEdge a series of coefficients representing power in
logarithmically spaced bands between loEdge and hiEdge a coefficient representing
power above hiEdge the RMS-norm gain value R Therefore the feature dimension
of NASE is B+3 Thus the NASE feature vector of an audio frame will be
represented as follows
xNASE = [R NASE(0) NASE(1) hellip NASE(B+1)]T (24)
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
23
Framing
Input Signal
Windowing
FFT
Normalized Audio Spectral Envelope
NASE
Subband Decomposition
Fig 23 The flowchart for computing NASE
625 125 250 500 1K 2K 4K 8K 16K
884 1768 3536 7071 14142 28284 56569 113137
1 coeff 16 coeffs 1 coeff
loEdge hiEdge
Fig 24 MPEG-7 octave-based subband decomposition with spectral resolution
r = 12
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
24
Table 23 The range of each Normalized audio spectral evenlope band-pass filter
Filter number Frequency interval (Hz) 0 (0 62] 1 (62 88] 2 (88 125] 3 (125 176] 4 (176 250] 5 (250 353] 6 (353 500] 7 (500 707] 8 (707 1000] 9 (1000 1414] 10 (1414 2000] 11 (2000 2828] 12 (2828 4000] 13 (4000 5656] 14 (5656 8000] 15 (8000 11313] 16 (11313 16000] 17 (16000 22050]
214 Modulation Spectral Analysis
MFCC OSC and NASE capture only short-term frame-based characteristics of
audio signals In order to capture the time-varying behavior of the music signals We
employ modulation spectral analysis on MFCC OSC and NASE to observe the
variations of the sound
2141 Modulation Spectral Contrast of MFCC (MMFCC)
To observe the time-varying behavior of MFCC modulation spectral analysis is
applied on MFCC trajectories Fig 25 shows the flowchart for extracting MMFCC
and the detailed steps will be described below
Step 1 Framing and MFCC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the MFCC coefficients of each frame
Step 2 Modulation Spectrum Analysis
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
25
Let be the l-th MFCC feature value of the i-th frame
The modulation spectrogram is obtained by applying FFT independently on
each feature value along the time trajectory within a texture window of
length W
][lMFCCi Ll ltle0
0 0 )()(1
0
2
)2( LlWmelMFCClmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (25)
where Mt(m l) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and l is the MFCC coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
LlWmlmMT
lmMT
tt
MFCC ltleltle= sum=
(26)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
( ))(max)(
lmMljMSP MFCC
ΦmΦ
MFCC
hjlj ltle= (27)
( ))(min)(
lmMljMSV MFCC
ΦmΦ
MFCC
hjlj ltle= (28)
where Φjl and Φjh are respectively the low modulation frequency index and
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
26
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(29) )( )()( ljMSVljMSPljMSC MFCCMFCCMFCC minus=
As a result all MSCs (or MSVs) will form a LtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 25 the flowchart for extracting MMFCC
2142 Modulation Spectral Contrast of OSC (MOSC)
To observe the time-varying behavior of OSC the same modulation spectrum
analysis is applied to the OSC feature values Fig 26 shows the flowchart for
extracting MOSC and the detailed steps will be described below
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
27
Step 1 Framing and OSC Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the OSC coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th OSC of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dOSCi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedOSCdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (30)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the OSC coefficient index In the
study W is 512 which is about 6 seconds with 50 overlap between two
successive texture windows The representative modulation spectrogram of a
music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
OSC ltleltle= sum=
(31)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands In the study
the number of modulation subbands is 8 (J = 8) The frequency interval of
each modulation subband is shown in Table 24 For each feature value the
modulation spectral peak (MSP) and modulation spectral valley (MSV)
within each modulation subband are then evaluated
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
28
( ))(max)(
dmMdjMSP OSC
ΦmΦ
OSC
hjlj ltle= (32)
( ))(min)(
dmMdjMSV OSC
ΦmΦ
OSC
hjlj ltle= (33)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(34) )( )()( djMSVdjMSPdjMSC OSCOSCOSC minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times20times8 = 320
Fig 26 the flowchart for extracting MOSC
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
29
2143 Modulation Spectral Contrast of NASE (MASE)
To observe the time-varying behavior of NASE the same modulation spectrum
analysis is applied to the NASE feature values Fig 27 shows the flowchart for
extracting MASE and the detailed steps will be described below
Step 1 Framing and NASE Extraction
Given an input music signal divide the whole music signal into successive
overlapped frames and extract the NASE coefficients of each frame
Step 2 Modulation Spectrum Analysis
Let be the d-th NASE of the i-th frame The
modulation spectrogram is obtained by applying FFT independently on each
feature value along the time trajectory within a texture window of length W
][dNASEi Dd ltle0
0 0 )()(1
0
2
)2( DdWmedNASEdmMW
n
mWnj
nWtt ltleltle= summinus
=
minus
+times
π (35)
where Mt(m d) is the modulation spectrogram for the t-th texture window m
is the modulation frequency index and d is the NASE coefficient index In
the study W is 512 which is about 6 seconds with 50 overlap between
two successive texture windows The representative modulation spectrogram
of a music track is derived by time averaging the magnitude modulation
spectrograms of all texture windows
0 0 )(1)(1
DdWmdmMT
dmMT
tt
NASE ltleltle= sum=
(36)
where T is the total number of texture windows in the music track
Step 3 ContrastValley Determination
The averaged modulation spectrum of each feature value will be
decomposed into J logarithmically spaced modulation subbands(See Table2
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
30
In the study the number of modulation subbands is 8 (J = 8) The frequency
interval of each modulation subband is shown in Table 24 For each feature
value the modulation spectral peak (MSP) and modulation spectral valley
(MSV) within each modulation subband are then evaluated
( ))(max)(
dmMdjMSP NASE
ΦmΦ
NASE
hjlj ltle= (37)
( ))(min)(
dmMdjMSV NASE
ΦmΦ
NASE
hjlj ltle= (38)
where Φjl and Φjh are respectively the low modulation frequency index and
high modulation frequency index of the j-th modulation subband 0 le j lt J
The MSPs correspond to the dominant rhythmic components and MSVs the
non-rhythmic components in the modulation subbands Therefore the
difference between MSP and MSV will reflect the modulation spectral
contrast distribution
(39) )( )()( djMSVdjMSPdjMSC NASENASENASE minus=
As a result all MSCs (or MSVs) will form a DtimesJ matrix which contains the
modulation spectral contrast information Therefore the feature dimension of
MMFCC is 2times19times8 = 304
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
31
WindowingAverage
Modulation Spectrum
ContrastValleyDetermination
DFT
NASE extraction
Framing
M1d[m]
M2d[m]
MTd[m]
M3d[m]
MT-1d[m]
MD[m]
NASEI[d]NASEI-1[d]NASE1[d]NASE2[d]
sI[n]sI-1[n]s1[n] s3[n]s2[n]
Music signal
NASE
M1[m]
M2[m]
M3[m]
MD-1[m]
Fig 27 the flowchart for extracting MASE
Table 24 Frequency interval of each modulation subband
Filter number Modulation frequency index range Modulation frequency interval (Hz)0 [0 2) [0 033) 1 [2 4) [033 066) 2 [4 8) [066 132) 3 [8 16) [132 264) 4 [16 32) [264 528) 5 [32 64) [528 1056) 6 [64 128) [1056 2112) 7 [128 256) [2112 4224]
215 Statistical Aggregation of Modulation Spectral Feature Values
Each row of the MSC (or MSV) matrix corresponds to the same spectralcepstral
feature value of variant modulation frequency which reflects the beat interval of a
music signal(See Fig 28) Each column of the MSC (or MSV) matrix corresponds to
the same modulation subband of different spectralcepstral feature values(See Fig 29)
To reduce the dimension of the feature space the mean and standard deviation along
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
32
each row (and each column) of the MSC and MSV matrices will be computed as the
feature values
2151 Statistical Aggregation of MMFCC (SMMFCC)
The modulation spectral feature values derived from the l-th (0 le l lt L) row of
the MSC and MSV matrices of MMFCC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSC ljMSC
Jlu (40)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSC
MFCCMFCCrowMSC luljMSC
Jlσ (41)
)(1)(1
0summinus
=minus =
J
j
MFCCMFCCrowMSV ljMSV
Jlu (42)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
MFCCrowMSV
MFCCMFCCrowMSV luljMSV
Jlσ (43)
Thus the row-based modulation spectral feature vector of a music track is of size 4L
and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
LLuLLu
uuMFCC
rowMSVMFCC
rowMSVMFCC
rowMSCMFCC
rowMSC
MFCCrowMSV
MFCCrowMSV
MFCCrowMSC
MFCCrowMSC
MFCCrow
σσ
σσ Lf (44)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSC ljMSC
Lju (45)
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSC
MFCCMFCCcolMSC juljMSC
Ljσ (46)
)(1)(1
0summinus
=minus =
L
l
MFCCMFCCcolMSV ljMSV
Lju (47)
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
33
))()((1)(211
0
2 ⎟⎠
⎞⎜⎝
⎛minus= sum
minus
=minusminus
L
l
MFCCcolMSV
MFCCMFCCcolMSV juljMSV
Ljσ (48)
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
)]1( )1( )1( )1(
)0( )0( )0( )0([Tminusminusminusminus
=
minusminusminusminus
minusminusminusminus
JJuJJu
uuMFCC
colMSVMFCC
colMSVMFCC
colMSCMFCC
colMSC
MFCCcolMSV
MFCCcolMSV
MFCCcolMSC
MFCCcolMSC
MFCCcol
σσ
σσ Lf (49)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
size (4D+4J) can be obtained
f MFCC= [( )MFCCrowf T ( )MFCC
colf T]T (50)
In summary the row-based MSCs (or MSVs) is of size 4L = 4times20 = 80 and the
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMMFCC is 80+32 = 112
2152 Statistical Aggregation of MOSC (SMOSC)
The modulation spectral feature values derived from the d-th (0 le d lt D) row of
the MSC and MSV matrices of MOSC can be computed as follows
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSC djMSC
Jdu (51)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSC
OSCOSCrowMSC dudjMSC
Jdσ (52)
)(1)(1
0summinus
=minus =
J
j
OSCOSCrowMSV djMSV
Jdu (53)
))()((1)(21
1
0
2⎟⎟⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minusminus
J
j
OSCrowMSV
OSCOSCrowMSV dudjMSV
Jdσ (54)
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
34
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD OSCrowMSV
OSCrowMSVrow σ
(55)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuOSCMSC
OSCrowMSC
OSCrowMSV
OSCrowMSV
OSCrowMSC
OSCrowMSC
OSCrow
σ
σσ Lf
)(1 1
0)( sum
minus
=minuscolMSC djMSCju (56) =
D
d
OSCOSC
D
))( 2 ⎟⎠
minus minusOSC
colMSC ju (57) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
OSCOSCcolMSV djMSV
Dju (58)
))() 2 ⎟⎠
minus minusOSC
colMSV ju (59) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
OSCOSCcolMSV djMSV
Djσ
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ OSCcolMSV
OSCcolMSV
OSCcolMSC σσ
(60)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times20 = 80 and the
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuOSC
colMSC
OSCcolMSV
OSCcolMSV
OSCcolMSC
OSCcolMSC
OSCcol σσ Lf
size (4D+4J) can be obtained
f OSC= [( OSCrowf )T ( OSC
colf )T]T (61)
In summary the row-base
column-based MSCs (or MSVs) is of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
35
rived from the d-th (0 le d lt D) row of
ows
feature dimension of SMOSC is 80+32 = 112
2153 Statistical Aggregation of MASE (SMASE)
The modulation spectral feature values de
the MSC and MSV matrices of MASE can be computed as foll
)(1)(1
0summinus
=minusrowMSC =
J
j
NASENASE djMSCJ
du (62)
( 2⎟⎟minus NAS
wMSCu (63) )))((1)(21
1
0 ⎠
⎞⎜⎜⎝
⎛= sum
minus
=minusminus
J
j
Ero
NASENASErowMSC ddjMSC
Jdσ
)(1)(1
0summinus
=minus =
J
j
NASENASErowMSV djMSV
Jdu (64)
))() 2⎟⎟minus
NASErowMSV du (65) ((1)(
211
0 ⎠
⎞⎜⎜⎝
⎛minus= sum
minus
=minus
J
j
NASENASErowMSV djMSV
Jdσ
Thus the row-based modulation spectral feature vector of a music track is of size 4D
and can be represented as
)1 Tminusminusminus minusminusminus DDuD NASErowMSV
NASErowMSVrow σ
(66)
Similarly the modulation spectral feature values derived from the j-th (0 le j lt J)
column of the MSC and MSV matrices can be computed as follows
)]1( )1( )1( (
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Du
uuNASEMSC
NASErowMSC
NASErowMSV
NASErowMSV
NASErowMSC
NASErowMSC
NASErow
σ
σσ Lf
)(1)(1
0summinus
=minuscolMSC =
D
d
NASENASE djMSCD
ju (67)
))( 2 ⎟⎠
minus minusNASE
colMSC ju (68) )((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSC ljMSC
Djσ
)(1)(1
0summinus
=minus =
D
d
NASENASEcolMSV djMSV
Dju (69)
))() 2 ⎟⎠
minus minusNASE
colMSV ju (70) ((1)(211
0
⎞⎜⎝
⎛= sum
minus
=minus
D
d
NASENASEcolMSV djMSV
Djσ
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
36
Thus the column-based modulation spectral feature vector of a music track is of size
4J and can be represented as
Tminusminusminus minusminusminus JJuJ NASEcolMSV
NASEcolMSV
NASEcolMSC σσ
(71)
If the row-based modulation spectral feature vector and column-based
modulation spectral feature vector are combined together a larger feature vector of
d MSCs (or MSVs) is of size 4D = 4times19 = 76 and the
SC r M is
)]1( )1( )1( )1(
)0( )0( )0( )0([
minus
=
minus
minusminusminusminus
Ju
uuNASE
colMSC
NASEcolMSV
NASEcolMSV
NASEcolMSC
NASEcolMSC
NASEcol σσ Lf
size (4D+4J) can be obtained
f NASE= [( NASErowf )T ( NASE
colf )T]T (72)
In summary the row-base
column-based M s (o SVs) of size 4J = 4times8 = 32 By combining the
row-based modulation spectral feature vector and column-based modulation spectral
feature vector will result in a feature vector of length 4L+4J That is the overall
feature dimension of SMASE is 76+32 = 108
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
37
MSC(1 2) MSV(1 2)
MSC(2 2)MSV(2 2)
MSC(J 2)MSV(J 2)
MSC(2 D) MSV(2 D)
row
row
2
2
σ
μ
Fig 28 the row-based modulation spectral
Fig 29 the column-based modulation spectral
MSC(1D) MSV(1D)
MSC(1 1) MSV(1 1)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 1)MSV(J 1)
rowD
rowD
σ
μ
row
row
1
1
σ
μ
Modulation Frequency
Texture Window Feature
Dimension
MSC(1D) MSV(1D)
MSC(1 2) MSV(1 2)
MSC(1 1) MSV(1 1)
MSC(2 D) MSV(2 D)
MSC(2 2)MSV(2 2)
MSC(2 1)MSV(2 1)
MSC(J D) MSV(J D)
MSC(J 2) MSV(J 2)
MSC(J 1) MSV(J 1)
Modulation Frequency
Feature Dimension
Texture Window
col
col
1
1
σ
μcol
col
2
2
σ
μ
colJ
colJ
σ
μ
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
38
216 Feature Vector Normalization
In the training phase the representative feature vector for a specific music genre
is derived by averaging the feature vectors for the whole set of training music signals
of the same genre
sum=
=cN
nnc
cc N 1
1 ff (73)
where denotes the feature vector of the n-th music signal belonging to the c-th
music genre
ncf
cf is the representative feature vector for the c-th music genre and Nc
is the number of training music signals belonging to the c-th music genre Since the
dynamic ranges for variant feature values may be different a linear normalization is
applied to get the normalized feature vector cf
)()()()()(ˆ
minmax
min
mfmfmfmfmf c
c minusminus
= Cc lele1 (74)
where C is the number of classes denotes the m-th feature value of the c-th
representative feature vector and denote respectively the
maximum and minimum of the m-th feature values of all training music signals
)(ˆ mfc
)(max mf )(min mf
(75) )(min)(
)(max)(
11min
11max
mfmf
mfmf
cjNjCc
cjNjCc
c
c
lelelele
lelelele
=
=
where denotes the m-th feature value of the j-th training music piece
belonging to the c-th music genre
)(mfcj
22 Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) [26] aims at improving the classification
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
39
accuracy at a lower dimensional feature vector space LDA deals with the
discrimination between various classes rather than the representation of all classes
The objective of LDA is to minimize the within-class distance while maximize the
between-class distance In LDA an optimal transformation matrix that maps an
H-dimensional feature space to an h-dimensional space (h le H) has to be found in
order to provide higher discriminability among various music classes
Let SW and SB denote the within-class scatter matrix and between-class scatter
matrix respectively The within-class scatter matrix is defined as
)()(1
T
1sumsum
= =
minusminus=C
ccnc
N
ncnc
c
xxxxSW (76)
where xcn is the n-th feature vector labeled as class c cx is the mean vector of class
c C is the total number of music classes and Nc is the number of training vectors
labeled as class c The between-class scatter matrix is given by
))((1
Tsum=
minusminus=C
ccccN xxxxSB (77)
where x is the mean vector of all training vectors The most widely used
transformation matrix is a linear mapping that maximizes the so-called Fisher
criterion JF defined as the ratio of between-class scatter to within-class scatter
(78) ))()(()( T1T ASAASAA BWminus= trJ F
From the above equation we can see that LDA tries to find a transformation matrix
that maximizes the ratio of between-class scatter to within-class scatter in a
lower-dimensional space In this study a whitening procedure is intergrated with LDA
transformation such that the multivariate normal distribution of the set of training
vectors becomes a spherical one [23] First the eigenvalues and corresponding
eigenvectors of SW are calculated Let Φ denote the matrix whose columns are the
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
40
orthonormal eigenvectors of SW and Λ the diagonal matrix formed by the
corresponding eigenvalues Thus SWΦ = ΦΛ Each training vector x is then
whitening transformed by ΦΛ-12
(79) )( T21 xΦΛx minus=w
It can be shown that the whitened within-class scatter matrix
derived from all the whitened training vectors will
become an identity matrix I Thus the whitened between-class scatter matrix
contains all the discriminative information A
transformation matrix Ψ can be determined by finding the eigenvectors of
Assuming that the eigenvalues are sorted in a decreasing order the eigenvectors
corresponding to the (Cndash1) largest eigenvalues will form the column vectors of the
transformation matrix Ψ Finally the optimal whitened LDA transformation matrix
A
)()( 21T21 minusminus= ΦΛSΦΛS WWw
)()( 21T21 minusminus= ΦΛSΦΛS BBw
wBS
WLDA is defined as
(80) WLDA ΨΦΛA 21minus=
AWLDA will be employed to transform each H-dimensional feature vector to be a lower
h-dimensional vector Let x denote the H-dimensional feature vector the reduced
h-dimensional feature vector can be computed by
(81) T xAy WLDA=
23 Music Genre Classification Phase
In the classification phase the row-based as well as column-based modulation
spectral feature vectors are first extracted from each input music track The same
linear normalization process is applied to each feature value The normalized feature
vector is then transformed to be a lower-dimensional feature vector by using the
whitened LDA transformation matrix AWLDA Let y denotes the whitened LDA
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
41
transformed feature vector In this study the nearest centroid classifier is used for
music genre classification For the c-th (1 le c le C) music genre the centroid of
whitened LDA transformed feature vectors of all training music tracks labeled as the
c-th music genre is regarded as its representative feature vector
sum=
=cN
nnc
cc N 1
1 yy (82)
where ycn denotes the whitened LDA transformed feature vector of the n-th music
track labeled as the c-th music genre cy is the representative feature vector of the
c-th music genre and Nc is the number of training music tracks labeled as the c-th
music genre The distance between two feature vectors is measured by Euclidean
distance Thus the subject code s that denotes the identified music genre is
determined by finding the representative feature vector that has minimum Euclidean
distance to y
) (minarg1
cCc
ds yylele
= (83)
Chapter 3
Experimental Results
The music database employed in the ISMIR2004 Audio Description Contest [33]
was used for performance comparison The database consists of 1458 music tracks in
which 729 music tracks are used for training and the other 729 tracks for testing The
audio file format is 441 kHz 128 kbps 16 bits per sample stereo MP3 files In this
study each MP3 audio file is first converted into raw digital audio before
classification These music tracks are classified into six classes (that is C = 6)
Classical Electronic JazzBlue MetalPunk RockPop and World In summary the
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
42
music tracks used for trainingtesting include 320320 tracks of Classical 115114
tracks of Electronic 2626 tracks of JazzBlue 4545 tracks of MetalPunk 101102
tracks of RockPop and 122122 tracks of World music genre
Since the music tracks per class are not equally distributed the overall accuracy
of correctly classified genres is evaluated as follows
(84) 1sumlele
sdot=Cc
cc CAPCA
where Pc is the probability of appearance of the c-th music genre CAc is the
classification accuracy for the c-th music genre
31 Comparison of row-based modulation spectral feature vector
Table 31 shows the average classification accuracy for each row-based
modulation spectral feature vector In this table SMMFCC1 SMOSC1 and SMASE1
denote respectively the row-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMOSC1 provides better classification accuracy than SMMFCC1 and SMASE1
and the combined feature vector performs the best Table 32 show the corresponding
confusion matrices
Table 31 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC1 7750 SMOSC1 7915 SMASE1 7778
SMMFCC1+SMOSC1+SMASE1 8464
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
43
Table 32 Confusion matrices of row-based modulation spectral feature vector (a) SMMFCC1 (b) SMOSC1 (c) SMASE1 (d) SMMFCC1+ SMOSC1+ SMASE1
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 275 0 2 0 1 19 Electronic 0 91 0 1 7 6
Jazz 6 0 18 0 0 4 MetalPunk 2 3 0 36 20 4 PopRock 4 12 5 8 70 14
World 33 8 1 0 4 75 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 8594 000 769 000 098 1557 Electronic 000 7982 000 222 686 492
Jazz 188 000 6923 000 000 328 MetalPunk 063 263 000 8000 1961 328 PopRock 125 1053 1923 1778 6863 1148
World 1031 702 385 000 392 6148
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 292 1 1 0 2 10 Electronic 1 89 1 2 11 11
Jazz 4 0 19 1 1 6 MetalPunk 0 5 0 32 21 3 PopRock 0 13 3 10 61 8
World 23 6 2 0 6 84 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9125 088 385 000 196 820 Electronic 031 7807 385 444 1078 902
Jazz 125 000 7308 222 098 492 MetalPunk 000 439 000 7111 2059 246 PopRock 000 1140 1154 2222 5980 656
World 719 526 769 000 588 6885
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
44
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 286 3 1 0 3 18 Electronic 0 87 1 1 9 5
Jazz 5 4 17 0 0 9 MetalPunk 0 4 1 36 18 4 PopRock 1 10 3 7 68 13
World 28 6 3 1 4 73 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 8938 263 385 000 294 1475 Electronic 000 7632 385 222 882 410
Jazz 156 351 6538 000 000 738 MetalPunk 000 351 385 8000 1765 328 PopRock 031 877 1154 1556 6667 1066
World 875 526 1154 222 392 5984
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 1 0 0 9 Electronic 0 96 1 1 9 9
Jazz 2 1 21 0 0 1 MetalPunk 0 1 0 34 8 1 PopRock 1 9 2 9 80 16
World 17 7 1 1 5 86 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 385 000 000 738 Electronic 000 8421 385 222 882 738
Jazz 063 088 8077 000 000 082 MetalPunk 000 088 000 7556 784 082 PopRock 031 789 769 2000 7843 1311
World 531 614 385 222 490 7049
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
45
32 Comparison of column-based modulation spectral feature vector
Table 33 shows the average classification accuracy for each column-based
modulation spectral feature vector In this table SMMFCC2 SMOSC2 and SMASE2
denote respectively the column-based modulation spectral feature vector derived form
modulation spectral analysis of MFCC OSC and NASE From table 31 we can see
that SMASE2 has the better classification accuracy than SMMFCC2 and SMASE2
which is different from the row-based With the same result the combined feature
vector also get the best performance Table 34 show the corresponding confusion
matrices
Table 33 Averaged classification accuracy (CA ) for row-based modulation
Feature Set CA
SMMFCC2 7064 SMOSC2 6859 SMASE2 7174
SMMFCC2+SMOSC2+SMASE2 7860
Table 34 Confusion matrices of column-based modulation spectral feature vector (a) SMMFCC2 (b) SMOSC2 (c) SMASE2 (d) SMMFCC2+ SMOSC2+ SMASE2
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 272 1 1 0 6 22 Electronic 0 84 0 2 8 4
Jazz 13 1 19 1 2 19 MetalPunk 2 7 0 39 30 4 PopRock 0 11 3 3 47 19
World 33 10 3 0 9 54 Total 320 114 26 45 102 122
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
46
(a) Classic Electronic Jazz MetalPunk PopRock World Classic 8500 088 385 000 588 1803
Electronic 000 7368 000 444 784 328 Jazz 406 088 7308 222 196 1557
MetalPunk 063 614 000 8667 2941 328 PopRock 000 965 1154 667 4608 1557
World 1031 877 1154 000 882 4426
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 262 2 0 0 3 33 Electronic 0 83 0 1 9 6
Jazz 17 1 20 0 6 20 MetalPunk 1 5 0 33 21 2 PopRock 0 17 4 10 51 10
World 40 6 2 1 12 51 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 8188 175 000 000 294 2705 Electronic 000 7281 000 222 882 492
Jazz 531 088 7692 000 588 1639 MetalPunk 031 439 000 7333 2059 164 PopRock 000 1491 1538 2222 5000 820
World 1250 526 769 222 1176 4180
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 277 0 0 0 2 29 Electronic 0 83 0 1 5 2
Jazz 9 3 17 1 2 15 MetalPunk 1 5 1 35 24 7 PopRock 2 13 1 8 57 15
World 31 10 7 0 12 54 Total 320 114 26 45 102 122
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
47
(c) Classic Electronic Jazz MetalPunk PopRock World Classic 8656 000 000 000 196 2377
Electronic 000 7281 000 222 490 164 Jazz 281 263 6538 222 196 1230
MetalPunk 031 439 385 7778 2353 574 PopRock 063 1140 385 1778 5588 1230
World 969 877 2692 000 1176 4426
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 289 5 0 0 3 18 Electronic 0 89 0 2 4 4
Jazz 2 3 19 0 1 10 MetalPunk 2 2 0 38 21 2 PopRock 0 12 5 4 61 11
World 27 3 2 1 12 77 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9031 439 000 000 294 1475 Electronic 000 7807 000 444 392 328
Jazz 063 263 7308 000 098 820 MetalPunk 063 175 000 8444 2059 164 PopRock 000 1053 1923 889 5980 902
World 844 263 769 222 1176 6311
33 Combination of row-based and column-based modulation
spectral feature vectors
Table 35 shows the average classification accuracy of the combination of
row-based and column-based modulation spectral feature vectors SMMFCC3
SMOSC3 and SMASE3 denote respectively the combined feature vectors of MFCC
OSC and NASE Comparing this table with Table31 and Table33 we can see that
the combined feature vector will get a better classification performance than each
individual row-based or column-based feature vector Especially the proposed
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
48
method (SMMFCC3+SMOSC3+SMASE3) achieves a high classification accuracy of
8532 Table 36 shows the corresponding confusion matrices
Table 35 Averaged classification accuracy (CA ) for the combination of the row-based and the-column based modulation
Feature Set CA
SMMFCC3 8038 SMOSC3 8134 SMASE3 8121
SMMFCC3+SMOSC3+SMASE3 8532
Table 36 Confusion matrices of the combination of row-based and column-based modulation spectral feature vector
(a) SMMFCC3 (b) SMOSC3 (c) SMASE3 (d) SMMFCC3+ SMOSC3+ SMASE3 (a) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 1 0 3 19 Electronic 0 86 0 1 7 5
Jazz 2 0 18 0 0 3 MetalPunk 1 4 0 35 18 2 PopRock 1 16 4 8 67 13
World 16 6 3 1 7 80 Total 320 114 26 45 102 122
(a) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 385 000 294 1557 Electronic 000 7544 000 222 686 410
Jazz 063 000 6923 000 000 246 MetalPunk 031 351 000 7778 1765 164 PopRock 031 1404 1538 1778 6569 1066
World 500 526 1154 222 686 6557
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
49
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 0 0 0 1 13 Electronic 0 90 1 2 9 6
Jazz 0 0 21 0 0 4 MetalPunk 0 2 0 31 21 2 PopRock 0 11 3 10 64 10
World 20 11 1 2 7 87 Total 320 114 26 45 102 122
(b) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 000 000 000 098 1066 Electronic 000 7895 385 444 882 492
Jazz 000 000 8077 000 000 328 MetalPunk 000 175 000 6889 2059 164 PopRock 000 965 1154 2222 6275 820
World 625 965 385 444 686 7131
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 296 2 1 0 0 17 Electronic 1 91 0 1 4 3
Jazz 0 2 19 0 0 5 MetalPunk 0 2 1 34 20 8 PopRock 2 13 4 8 71 8
World 21 4 1 2 7 81 Total 320 114 26 45 102 122
(c) Classic Electronic Jazz MetalPunk PopRock World
Classic 9250 175 385 000 000 1393 Electronic 031 7982 000 222 392 246
Jazz 000 175 7308 000 000 410 MetalPunk 000 175 385 7556 1961 656 PopRock 063 1140 1538 1778 6961 656
World 656 351 385 444 686 6639
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
50
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 300 2 0 0 0 8 Electronic 2 95 0 2 7 9
Jazz 1 1 20 0 0 0 MetalPunk 0 0 0 35 10 1 PopRock 1 10 3 7 79 11
World 16 6 3 1 6 93 Total 320 114 26 45 102 122
(d) Classic Electronic Jazz MetalPunk PopRock World
Classic 9375 175 000 000 000 656 Electronic 063 8333 000 444 686 738
Jazz 031 088 7692 000 000 000 MetalPunk 000 000 000 7778 980 082 PopRock 031 877 1154 1556 7745 902
World 500 526 1154 222 588 7623
Conventional methods use the energy of each modulation subband as the
feature value However we use the modulation spectral contrasts (MSCs) and
modulation spectral valleys (MSVs) computed from each modulation subband as
the feature value Table 37 shows the classification results of these two
approaches From Table 37 we can see that the using MSCs and MSVs have
better performance than the conventional method when row-based and
column-based modulation spectral feature vectors are combined In this table
SMMFCC1 SMMFCC2 and SMMFCC3 denote respectively the row-based
column-based and combined feature vectors derived from modulation spectral
analysis of MFCC
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
51
Table 37 Comparison of the averaged classification accuracy of the MSCampMSV and the energy for each feature value
Feature Set MSCsamp MSVs MSE SMMFCC1 7750 7202 SMMFCC2 7064 6982 SMMFCC3 8038 7915 SMOSC1 7915 7750 SMOSC2 6859 7051 SMOSC3 8134 8011 SMASE1 7778 7641 SMASE2 7174 7106 SMASE3 8121 7915
SMMFCC1+SMOSC1+SMASE1 8464 8508 SMMFCC2+SMOSC2+SMASE2 7860 7901 SMMFCC3+SMOSC3+SMASE3 8532 8519
Chapter 4
Conclusion
A novel feature set derived from modulation spectral analysis of spectral
(OSC and NASE) and cepstral (MFCC) features is proposed for music genre
classification The long-term modulation spectrum analysis is employed to capture the
time-varying behavior of each feature value For each spectralcepstral feature set a
modulation spectrogram will be generated by collecting the modulation spectrum of
all corresponding feature values Modulation spectral contrast (MSC) and modulation
spectral valley (MSV) are then computed from each logarithmically-spaced
modulation subband Statistical aggregations of all MSCs and MSVs are computed to
generate effective and compact discriminating features The music database employed
in the ISMIR2004 Audio Description Contest where all music tracks are classified
into six classes was used for performance comparison If the modulation spectral
features of MFCC OSC and NASE are combined together the classification
accuracy is 8532 which is better than the winner of the ISMIR2004 Music Genre
Classification Contest
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
52
References
[1] G Tzanetakis P Cook Musical genre classification of audio signals IEEE
Trans on Speech and Audio Processing 10 (3) (2002) 293-302
[2] T Li M Ogihara Q Li A Comparative study on content-based music genre
classification Proceedings of ACM Conf on Research and Development in
Information Retrieval 2003 pp 282-289
[3] D N Jiang L Lu H J Zhang J H Tao L H Cai Music type classification
by spectral contrast feature Proceedings of the IEEE International Conference
on Multimedia amp Expo vol 1 2002 pp 113-116
[4] K West and S Cox ldquoFeatures and classifiers for the automatic classification of
musical audio signalsrdquo Proceedings of International Conference on Music
Information Retrieval 2004
[5] K Umapathy S Krishnan S Jimaa Multigroup classification of audio signals
using time-frequency parameters IEEE Trans on Multimedia 7 (2) (2005)
308-315
[6] M F McKinney J Breebaart Features for audio and music classification
Proceedings of the 4th International Conference on Music Information Retrieval
2003 pp 151-158
[7] J J Aucouturier F Pachet Representing music genres a state of the art Journal
of New Music Research 32 (1) (2003) 83-93
[8] U Bağci E Erzin Automatic classification of musical genres using inter-genre
similarity IEEE Signal Processing Letters 14 (8) (2007) 512-524
[9] A Meng P Ahrendt J Larsen L K Hansen Temporal feature integration for
music genre classification IEEE Trans on Audio Speech and Language
Processing 15 (5) (2007) 1654-1664
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
53
[10] T Lidy A Rauber Evaluation of feature extractors and psycho-acoustic
transformations for music genre classification Proceedings of the 6th
International Conference on Music Information Retrieval 2005 pp 34-41
[11] M Grimaldi P Cunningham A Kokaram A wavelet packet representation of
audio signals for music genre classification using different ensemble and feature
selection techniques Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia Information Retrieval 2003 pp102-108
[12] J J Aucouturier F Pachet and M Sandler The way it sounds timbre
models for analysis and retrieval of music signals IEEE Transactions on
Multimedia Vol 7 Issue 6 pp1028 - 1035 Dec 2005
[13] J Jose Burred and A Lerch ldquoA hierarchical approach to automatic musical
genre classificationrdquo in Proc of the 6th Int Conf on Digital Audio Effects pp
8-11 September 2003
[14] J G A Barbedo and A Lopes Research article automatic genre classification
of musical signals EURASIP Journal on Advances in Signal Processing Vol
2007 pp1-12 June 2006
[15] T Li and M Ogihara ldquoMusic genre classification with taxonomyrdquo in Proc of
IEEE Int Conf on Acoustics Speech and Signal Processing Vol 5 pp 197-200
March 2005
[16] J J Aucouturier and F Pachet ldquoRepresenting musical genre a state of the artrdquo
Journal of new musical research Vol 32 No 1 pp 83-93 2003
[17] H G Kim N Moreau T Sikora Audio classification based on MPEG-7 spectral
basis representation IEEE Trans On Circuits and Systems for Video Technology
14 (5) (2004) 716-725
[18] M E P Davies and M D Plumbley ldquoBeat tracking with a two state modelrdquo in
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
54
Proc Int Conf on Acoustic Speech and Signal Processing (ICASSP) 2005
[19] W A Sethares R D Robin J C Sethares Beat tracking of musical
performance using low-level audio feature IEEE Trans on Speech and Audio
Processing 13 (12) (2005) 275-285
[20] G Tzanetakis A Ermolinskyi and P Cook ldquoPitch Histogram in Audio and
Symbolic Music Information Retrievalrdquo in Proc IRCAM 2002
[21] T Tolonen and M Karjalainen ldquoA computationally efficient multipitch analysis
modelrdquo IEEE Transactions on Speech and Audio Processing Vol 8 No 6 pp
708-716 November 2000
[22] R Meddis and L OrsquoMard ldquoA unitary model of pitch perceptionrdquo Acoustical
Society of America Vol 102 No 3 pp 1811-1820 September 1997
[23] N Scaringella G Zoia and D Mlynek Automatic genre classification of music
content a survey IEEE Signal Processing Magazine Vol 23 Issue 2 pp133 -
141 Mar 2006
[24] B Kingsbury N Morgan and S Greenberg ldquoRobust speech recognition using
the modulation spectrogramldquo Speech Commun Vol 25 No 1 pp117-132
1998
[25] S Sukittanon L E Atlas and J W Pitton ldquoModulation-scale analysis for
content identificationrdquo IEEE Transactions on signal processing Vol 52 No 10
pp3023-3035 October 2004
[26] Y Y Shi X Zhu H G Kim and K W Eom A tempo feature via modulation
spectrum analysis and its application to music emotion classification in 2006
IEEE International Conference on Multimedia and Expo (ICME) pp1085-1088
July 2006
[27] H G Kim N Moreau T Sikora MPEG-7 Audio and Beyond audio content
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139
55
indexing and retrieval Wiley 2005
[28] R Duda P Hart and D Stork Pattern Classification New YorkWiley 2000
[29] C Xu N C Maddage and X Shao ldquoAutomatic music classification and
summarizationrdquo IEEE Transactions on Speech and Audio Processing Vol 13
No 3 pp 441-450 May 2005
[30] S Esmaili S Krishnan and K Raahemifar Content based audio classification
and retrieval using joint time-frequency analysis in 2004 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) Vol 5 ppV
- 665-8 May 2004
[31] K Umapathy S Krishnan and R K Rao ldquoAudio signal feature extraction and
classification using local discriminant basesrdquo IEEE Transactions on Audio
Speech and Language Processing Vol 15 Issue 4 pp1236 ndash 1246 May 2007
[32] M Grimaldi P Cunningham A Kokaram An evaluation of alternative feature
selection strategies and ensemble techniques for classifying music Proceedings
of Workshop in Multimedia Discovery and Mining 2003
[33] J Bergatra N Casagrande D Erhan D Eck B Keacutegl Aggregate features and
Adaboost for music classification Machine Learning 65 (2-3) (2006) 473-484
[34] Freund Y and R E Schapire 1997 lsquoA decision-theoretic generalization of
online learning and an application to boostingrsquo Journal of Computer and System
Sciences 55(1) 119ndash139