Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
國立臺灣大學電機資訊學院資訊工程學系
碩士論文Department of Computer Science and Information Engineering
College of Electrical Engineering and Computer Science
National Taiwan University
Master Thesis
基於學習音樂情緒轉變之
個人化播放清單推薦系統
Learning Emotion Transitions forA Personalized Playlist Recommender
紀忠毅Chung-Yi Chi
指導教授:許永真 博士、薛智文 博士
Advisor: Jane Yung-jen Hsu, Ph.D.
Chih-Wen Hsueh, Ph.D.
中華民國九十八年六月June 2009
ii
Acknowledgments
三萬六千日,夜夜當秉燭。 兩年充實的研究生活,一份豐碩的研究成果。 困
惑與挫折不曾間斷,感謝許多摯友一路上的陪伴與支持,攜我一一克服。
謝謝我的朋友,雅晴、曜安、炳傑與鴻銘。 研究的路上得以互吐苦水,切
磋砥礪。 一路的陪伴,不時的關心,與實驗上的建議。
謝謝實驗室的所有夥伴,讓身邊永遠充滿研究的氣息與愉悅的氣氛。 謝謝
婉容學姊長久以來的協助,任何疑難雜症都能順利解決。 謝謝翰文從大學以來
的幫助與包容,總是不厭其煩幫忙我解決任何問題。 謝謝 iPlayr 研究小組映
嫻、薇蓉與居正每週五晚上不懈的討論與建議。 謝謝冠鋆與 Todd 協助修訂英
文用字,使論文更臻完美。 謝謝所有幫忙蒐集實驗資料的同學, 嘉涓學姊、
啟嘉學長、于晉、皓遠、琮傑、中川、庭嫣、守壹與彥伶。
謝謝指導老師許永真教授與薛智文教授。 於研究上,提供莫大的協助, 使
我能在迷失方向時釐清目標,在遭遇瓶頸時指點迷津; 於生活中,提供中肯的
建議, 面對不可預期的未來與內心的徬徨無助,仍以樂觀積極的態度,專心致
志。 謝謝中研院許鈞南教授、交通大學陳穎平教授、 元智大學蔡宗翰教授,
以及台灣大學吳家麟教授擔任口試委員, 於口試過程中提供諸多寶貴的建議。
非學無以廣才,非志無以成學。 期許自己熱忱、努力、堅持,未來的每一
件事。 謹將這份成果獻給我的家人與所有關心我的人。
iii
iv
Abstract
The digitization and online distribution of music content in the Internet era have led to
an enormous volume of accessible digital music and diversified the ways with which
consumers explore music. In particular, an emerging trend in music exploration is to
organize and to search songs according to song emotions. However, research on Auto-
matic Playlist Generation (APG) primarily focuses on leveraging traditional metadata
and audio similarity for recommendation. Moreover, Mainstream solutions view APG
as a static problem.
This thesis argues that the APG problem is better modeled as a continuous opti-
mization problem, and proposes an adaptive preference model for personalized APG
based on emotions. The main idea is to collect a user's behavior in music playing (e.g.,
rating, skipping and replaying) as immediate feedback in learning the user's preferences
for music emotion within a playlist.
Reinforcement learning is adopted to learn the user's current preferences, which
are used to generate personalized playlists. Learning parameters are tuned via simula-
tion of two hypothetical users. Several evaluation metrics are defined to measure the
performance of our approach. A two-month user study is conducted to evaluate the
APG solutions. The results show that in most of the evaluation metrices the proposed
v
approach presents a superior performance in comparison with the baseline approach.
Keywords: Automatic Playlist Generation, Music Recommender System, Music
Emotion Estimation, Reinforcement Learning, Machine Learning
vi
摘要
數位化的時代來臨, 伴隨網際網路普及的推波助瀾, 數位音樂呈現爆炸式
的成長, 人們接觸音樂的方式也漸趨多元。 特別的是, 利用音樂情緒來組
織與搜尋歌曲成為一種新興的趨勢。 然而, 目前針對音樂播放清單自動產生
(Automatic Playlist Generation)的研究, 仍著重於利用傳統的音樂屬性與訊
號分析提供推薦。 而且, 大多數研究也將此問題視為一種靜態的推薦。
在本論文中,我們認為此類問題較適合採用連續性的最佳化問題來詮釋,
並於文中提出適應性的使用者喜好模型,以提供個人化的音樂播放清單推薦服
務。 主要概念為透過蒐集使用者聆聽音樂的行為(如評分、略過與重播)當作
即時回應, 從中學習其對於播放清單中,音樂情緒轉變的喜好。
我們採用強化式學習(Reinforcement Learning)來學習使用者的喜好, 並
產生播放清單的推薦。 此外,透過兩組使用者模擬的案例最佳化學習參數。
我們定義多個評估指標以衡量並比較不同推薦方法的優劣。 最後,透過兩個多
月的使用者研究,實際觀察不同推薦方法的適用性。 研究結果顯示,本論文提
出的推薦方法在大部份的評估指標中, 都擁有優於基準方法的表現。
關鍵詞: 播放清單自動產生、音樂推薦系統、音樂情緒判斷、強化式學
習、機器學習
vii
viii
Contents
Acknowledgments iii
Abstract v
List of Figures xii
List of Tables xiv
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2 Related Work 5
2.1 Music Emotion Models . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Music Emotion Estimation . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Automatic Playlist Generation . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Related Products . . . . . . . . . . . . . . . . . . . . . . . . 8
ix
2.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 3 Emotion-Based Personalized APG 17
3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 MEonPlay Automatic Playlist Recommender . . . . . . . . . 21
3.2.2 APG as a Reinforcement Learning Problem . . . . . . . . . . 22
Chapter 4 Emotion-Based Adaptive Preference Model 25
4.1 Annotated Music Dataset . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Songs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 Paritcipants . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.3 Music Emotion Model . . . . . . . . . . . . . . . . . . . . . 26
4.1.4 Annotation Process . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.5 Usage of the POP500 Dataset . . . . . . . . . . . . . . . . . 28
4.2 Preference Modeling with Reinforcement Learning . . . . . . . . . . 28
4.2.1 Modeling States and Actions . . . . . . . . . . . . . . . . . . 29
4.2.2 Designing the Reward Function . . . . . . . . . . . . . . . . 29
4.2.3 Solving APG with temporal-difference learning . . . . . . . . 31
4.2.4 Parameter Selection with Simulation . . . . . . . . . . . . . . 35
Chapter 5 Experimental Evaluation 39
5.1 The Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
x
5.2 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3.1 Miss Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3.2 Miss-to-Hit(k) . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.3 Listening-Time Ratio . . . . . . . . . . . . . . . . . . . . . . 43
5.3.4 User Rating . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 6 Conclusion 49
6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Bibliography 52
xi
List of Figures
2.1 James Russell's two-dimensional (valence-arousal) model of emotions. 6
2.2 Three steps to create playlists with iTunes Genius . . . . . . . . . . . 9
2.3 A snapshot of iTunes Genius. . . . . . . . . . . . . . . . . . . . . . . 10
2.4 The snapshot of Mirage. . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 The snapshot of Tangerine!. . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Selecting BPM, beat intensity range, and playlist duration in Tangerine!. 14
2.7 Selecting workout pattern in Tangerine!. . . . . . . . . . . . . . . . . 14
3.1 A sample user scenario. . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 The SystemArchitecture ofMEonPlayAutomatic Playlist Recommender. 20
3.3 The agent-environment interaction in reinforcement learning. . . . . . 23
4.1 The snapshot of the user interface for annotation in the three sessions. 27
4.2 Four emotion classes categorized by the cooresponding quadrants on
the V-A plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 State and action definition. . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Learning curve of HU-1 (simple case). . . . . . . . . . . . . . . . . . 36
xii
4.5 Learning curve of HU-2 (complicated case). . . . . . . . . . . . . . . 37
5.1 The histogram for the episode count of each user in training phase and
testing phase, respectively. . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 The snapshot of the interface of the experiment. . . . . . . . . . . . . 42
5.3 The Miss Ratio of Shuffle, SARSA, and Q-Learning methods. . . . . 45
5.4 The Miss-to-Hit(20) of Shuffle, SARSA, and Q-Learning methods. . . 45
5.5 The Continuous Play of Shuffle, SARSA, and Q-Learning methods. . 46
5.6 TheMiss-to-Hit Ratio of User 1 and User 2 underWorking and Leisure
context respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
xiii
List of Tables
3.1 The summary of the notations. . . . . . . . . . . . . . . . . . . . . . 18
4.1 Number of songs and percentage across four emotion classes (refer to
Figure 4.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Average reward per episode (normalized to range [0, 100]) with differ-
ent λ in both HUs and both methods. . . . . . . . . . . . . . . . . . . 38
5.1 The mean and standard deviation of Listening-Time Ratio of Shuffle,
SARSA, and Q-Learning methods. . . . . . . . . . . . . . . . . . . . 44
5.2 The mean and standard deviation of User Rating of Shuffle, SARSA,
and Q-Learning methods; scoring range: [1, 5]. . . . . . . . . . . . . 44
xiv
Chapter 1
Introduction
The sweet and passionate melody captivated his heart from the first note;
it was full of radiance, full of the tender throbbing of inspiration and hap-
piness and beauty, continually growing and melting away; it rumoured of
everything on earth that is dear, secret and sacred to mankind; it breathed
of immortal sadness and it departed from the earth to die in the heavens.
Ivan Turgenev, Home of the Gentry
1.1 Motivation
Thanks to the advance of the Internet and storage technology, the availability of low-
cost music has been massively increased in recent years. The way in which people
listen to and create music have been altered due to several trends, including the shift
from physical media to digital media and the dramatic drop in cost to produce and share
newmusic in theWeb 2.0 era. Today there are over 10 million different tracks available
1
2 CHAPTER 1. INTRODUCTION
on Apple's iTunes Store, with more than 6 billion songs sold since the service began on
April 28, 20031. Also, on peer-to-peer (P2P) networks there are more than 15 billion
tracks available for download. In the future these figures will undoubtedly seemminus-
cule. All music will be online, billions of new tracks will be available, and millions of
new arrivals will pour in everyday. The resulting problem is that finding the right mu-
sic is difficult. Besides, an analysis of 5600 iPod users2 has showed that 80% of plays
were from 23% of songs and the 64% of songs in their iPods were never played. This
phenomenon corresponds to the Long Tail, which was introduced by Chris Anderson
in 20043 to describe the niche strategy of businesses that sell a large number of unique
items, each in relatively small quantities. The issues of the music long tail are further
explained in [9]. Thus, there is an obvious need to assist people, not only consumers
but producers of music, to filter, discover, personalize, and recommend music from the
ever-growing sea of song tracks.
To solve the problem of tedious and time-consuming manual selection of music,
automatic playlist generation (APG) becomes one of the mainstream research. Various
methods for exploring and searching personal music collection are provided in cur-
rent music players. Users can search their song collections by specifying certain song
properties, including genres, artists, and albums. A novel and natural trend for music
exploration is organizing or searching by song emotion. The beautiful quotation at the
beginning of this chapter is from the book Home of the Gentry written by Ivan Tur-
genev. The protagonist of the novel is touched to the very depths of his soul by a piece
1http://www.techcrunch.com/2009/01/06/itunes-sells-6-billion-songs-and-other-fun-stats-from-the-philnote/
2http://blogs.sun.com/plamere/entry/what s on your ipod3http://www.wired.com/wired/archive/12.10/tail.html
1.1. MOTIVATION 3
of music being played on the piano. The elegant passage eloquently reveals the mys-
tical relationship between music and the human mind. Treatment of music emotion is
not limited to 19th-century literature however. Our requirements for daily music listen-
ing actually indicate the need for emotion-based music exploration. Imagine that after
a tiring workday, Johnny, an exhausted engineer, looks first for some relaxing music,
and would later like to enjoy some upbeat music afterward while planning a weekend
trip. In addition, the music emotion preference might also change over time. For ex-
ample, due to an approaching project deadline, Johnny might prefer more calm music
to help him concentrate on work, even though normally he would prefer exhilarating
music when not faced with a deadline.
With the advance of Emotion-basedMusic Information Retrieval (EMIR), emotion-
based personal playlist generation has becomes an emerging research filed. However,
little APG research takes music emotion into consideration. Most implementation cre-
ate playlists satisfying certain constraints given by users and systems, e.g. temporal
order of metadata or audio similarity [7, 6, 25] and temporal structure matching based
on patterns of previous cases [8]. Moreover, typical APG systems view the problem as
a static recommendation process and do not take the temporal effect of user preference
into consideration. Thus, current approaches of APG and music recommendation may
not provide an ideal solution.
4 CHAPTER 1. INTRODUCTION
1.2 Objectives
This thesis aims to provide a novel emotion-based personalized playlist recommender,
with two major objectives. One objective is to leverage music emotion as a major di-
mension for music exploration and recommendation. An investigation is required to
explore the application of music emotion. The second objective is to provide person-
alized automatic playlist generation. Here personalized APG consists of two require-
ments: (1) The recommendation should be tailored according to an individual user's
preferences. (2) The recommendation should be adaptive to the individual user's pref-
erences as they change over time. By incorporating these aspects into our emotion-
based personalized playlist recommender, we expect to attain a degree of improvement
over current APG services.
1.3 Thesis Structure
The rest of the thesis is organized as follows: The next chapter provides an overview
of the related work, including music emotion models, music emotion estimation, au-
tomatic playlist generation, and reinforcement learning. Chapter 3 explicitly defines
the problem and proposes our solution to solve the problem. Section 4 presents how
we model the APG as a reinforcement learning problem, and solve it with temporal-
difference learning methods. Using simulation to aid in tuning learning parameters is
also described in this chapter. Section 5 shows the results of experimental evaluation
by conducting a two-month user study. The last chapter summarizes our work with a
conclusion and provides suggestions for future work.
Chapter 2
Related Work
In this chapter, we will give an overview of related work in several topics, including
music emotion model, music emotion estimation, automatic playlist generation, and
reinforcement learning.
2.1 Music Emotion Models
Previous research has established a strong relationship between music and emotion
[14]. There are two aspects of how humans perceive emotion from music. (1) Ob-
jective: music can convey universal emotions [12]. (2) Subjective: music can evoke
different emotions for different people [15]. Modeling of emotion can be either cate-
gorical or vector. Categorical representation maps emotion into a number of different
classes. For example, there are anger, fear, sadness, happiness and disgust in Paul
Ekman's model of basic emotions [10]. The categorical model is the most commonly
used because of its clear definition and goal oriented nature. However, inflexibility
5
6 CHAPTER 2. RELATED WORK
of scaling the emotional coverage is the major limitation of the categorical model. In
contrast, the vector model represents emotions as points on a vector plane. Figure 2.1
illustrates the concept of James Russell's valence-arousal model of emotion [27]. The
plane is divided into quadrants: (1) positive valence with high arousal (Pos-High), (2)
negative valence with high arousal (Neg-High), (3) negative valence with low arousal
(Neg-Low), and (4) positive valence with low arousal (Pos-Low). A vector model has
the advantage of eliminating ambiguity between categories. For example, the ambigu-
ous meaning of cheerful and carefree may lower annotation consistency. In addition,
annotators may freely rate a song on a mood plane without being limited by pre-defined
tags [34].
Figure 2.1: James Russell's two-dimensional (valence-arousal) model of emotions. Thehorizontal axis (valence) represents pleasure/displeasure; the vertical axis (arousal) rep-resents low/high arousal.
2.2. MUSIC EMOTION ESTIMATION 7
2.2 Music Emotion Estimation
Music Emotion Estimation (MEE) is essential for music recommendation/retrieval
based on matching the emotion of music with that of the user. Most current MEE
systems adopt audio-based approaches [17, 16, 33, 34]. In general, there is a lack of
publicly available datasets for comparison of systems performance. The Computer Au-
dition Lab 500-song (CAL500) dataset was released by Turnbull in 2007 [32]. Thirty-
six emotion tags were defined and used in annotation of the songs. The CAL500 dataset
was used to train a Supervised Multi-class Labeling model (SML), which assigns mul-
tiple pre-defined tags to an input song with aMean Average Precision (MAP) of 50.6%.
CAL500 has been widely used in training and evaluating MEE systems.
2.3 Automatic Playlist Generation
The problem of playlist generationwas first introduced by Pachet et al. [19]. According
to the type of interaction with the user, current approaches of Automatic Playlist Gen-
eration (APG) can be categorized into two major forms: hint-based playlist generation
and constraint-based playlist generation. In hint-based playlist generation approach,
the user is required to specify one or more songs as seed songs. The playlist is then
generated to consist of songs similar to the seed songs. Different kinds of similarity
measures can be used in this approach, including metadata-based similarity [21, 24, 26]
and audio-based similarity [18, 20]. The major limitation of the hint-based playlist gen-
eration is the problem of having a highly uniform list of songs that may not be favored
by every user, especially when the length of the playlist is extensive. In the constraint-
8 CHAPTER 2. RELATED WORK
based playlist generation approach, the user is allowed to explicitly specify the con-
straints that the generated playlist has to satisfy, mainly in three aspects: songs, order,
and length. Rich metadata are used in [7, 4, 3, 5] to describe user-specified constraints,
e.g., fast tempo, or at least 40% country pop music. Audio similarity is also included as
another constraint in [6, 11], e.g., timbre continuity throughout the playlist and smooth
transition from the first song to the last song. Several formulation of this problem have
been proposed, including linear programming [4, 3], stimulated annealing [22, 23], and
traveling salesman problem [25].
2.3.1 Related Products
iTunes Genius
Apple iTunes Genius1 was a new feature first introduced in iTunes 8 to recommend
playlists. Two major functionalities are provided in iTunes Genius: Genius Playlist
and Genius Sidebar.
Genius Playlist, as implied in the name, is a method to create playlists. Select a
song, click the Genius button, and iTunes generates a playlist of songs from your library
that complement that song. In this way, Genius playlists help users to discover songs
in their library they never knew they had, and rediscover forgotten favorites. Figure
2.2 illustrates the three steps involved creating Genius playlists.
Genius Sidebar is a method to introduce users to newmusic, movies, and TV shows.
Select a song, movie, or show in your library, and the Genius sidebar will display rec-
1http://www.apple.com/itunes/features/#genius
2.3. AUTOMATIC PLAYLIST GENERATION 9
Figure 2.2: Three steps to create playlists with iTunes Genius: (1) Select a song. (2)Click the Genius button. (3) Genius creates a playlist.
ommendations from the iTunes Store which complement the selection. The Genius
sidebar will not recommend anything already in your library, and you can preview and
buy recommended items directly from the sidebar. Figure 2.3 is a snapshot of iTunes
Genius. In the middle of the iTunes window is the generated playlist based on the song
The Rose by Bianca Ryan. At the top-right region, users can either change the length of
the playlist (limiting to 25, 50, 75, or 100 songs), refresh, or save the playlist. The Ge-
nius Sidebar is located at the right side of the window, and displays recommendations
from the iTunes Store, including music, movies, and TV shows.
MusicIP MyDJ
The MusicIP2 MyDJ Plug-in, which works with several existing music players (in-
cluding iTunes, Windows Media Player, and WINAMP), instantly generates playlists
based on the selected song. The MyDJ Plug-in anonymously analyzes users' songs for
2http://www.musicip.com/
10 CHAPTER 2. RELATED WORK
Figure 2.3: A snapshot of iTunes Genius. In the middle of the window is the GeniusPlaylists that generate playlists based on the selected song. The Genius Sidebar islocated at the right side of the window, and it display recommendations from the iTunesStore that complement the selected song.
acoustic attributes and defining characteristics, allowing it to make intelligent mixes
and recommendations that go beyond genre classifications and editorial review. Once
analysis is complete, music is ready for playlisting. A user may select a song to gener-
ate a playlist of similar music from his or her libraries. While users are enjoying com-
pelling music combinations and rediscovering all the great music they already have,
MyDJ will introduce users to new music based on the songs they love most. MyDJ
also provides Software Development Kits (SDK) for digital devices, music softwares,
and websites to automatically generate playlists and other features.
2.3. AUTOMATIC PLAYLIST GENERATION 11
MoodLogic
MoodLogic3 was one of the first online music recommendation systems. The com-
pany obtained ratings on over 1 million songs by over 50,000 distinct listeners as part
of its proprietary method for modeling user preference space. In addition to their web
presence, the company created a software application that uses a central database to
allow users to collaboratively profile music by mood. Each user has a certain num-
ber of "credits" they can use to identify song profiles. Credits could be obtained by
either paying for them or profiling songs. This software allowed the user to generate
"mood" based playlists based on the mood of the user. The program could also mix
a playlist based on a selected song. This would return a playlist with songs of sim-
ilar tempo, mood, genre, etc. Moodlogic was acquired by All Media Guide (AMG),
the company that runs allmusic.com, in May 20064. It is not yet clear whether AMG
intends to resume development and reactivate the community. The MoodLogic site
resolves to Macrovision's site in March, 2008. Effective March 3, 2008, Macrovision
announced the End Of Life (EOL) of the Moodlogic music management and recom-
mendation software. Service was discontinued due to intensive operational and infras-
tructure resources required to sustain the application. Macrovision's efforts in music
recommendation will continue through the AMG Data Services Tapestry business-to-
business product5.
3http://www.moodlogic.com/4http:// www.cbronline.com/ article news.asp? guid=A47A2ACA-5830-4C10-B8A1-
872B5FD4D6CF5http://en.wikipedia.org/wiki/MoodLogic
12 CHAPTER 2. RELATED WORK
Mirage
Figure 2.4: The snapshot of Mirage.
Mirage6, an automatic playlist generation extension for Banshee, is an implementa-
tion of research in APG and music similarity. Mirage analyzes a user's music collection
and computes acoustic similarity models for each song. After the user's music collec-
tion has been analyzed, Mirage is able to automatically generate playlists of similar
music. In this project, Mirage was integrated into the popular GNOME audio player
Banshee7.
Music information retrieval techniques are used in Mirage to compute a similarity
model for each song. This process includes the computation of the Fast Fourier Trans-
form (FFT), the Mel Frequency Cepstrum Coefficients (MFCC) for psycho-acoustic
modeling, and amultidimensional GaussianMixtureModel (GMM) to finally represent
a song with a timbre/similarity model. After the whole music collection is analyzed,
6http://hop.at/mirage/7http://www.banshee-project.org/
2.3. AUTOMATIC PLAYLIST GENERATION 13
each song has a similarity/timbre model attached to it. Then users can start generat-
ing playlists by selecting a song (the seed song) they want the playlist to begin with
and Mirage searches all its models for similar songs. To do so, the Gaussian models
computed in the previous step are compared using the an optimized Kullback Leibler
divergence.
Tangerine!
Figure 2.5: The snapshot of Tangerine!.
Tangerine!8 is a playlist generation tool for workouts by Potion Factory9 (figure
2.5. It uses several song qualities: beats per minute (BPM), beat intensity and personal
song ratings. The user specifies what range of BPM and intensities needed and how8http://www.potionfactory.com/tangerine/9http://www.potionfactory.com/
14 CHAPTER 2. RELATED WORK
long the playlist lasts, as shown in figure 2.6.
Figure 2.6: Selecting BPM, beat intensity range, and playlist duration in Tangerine!.
Tangerine then automatically generate a random assortment of songs that meet the
criteria. The user is also allowed to select workout patterns. For example, one can
ramp up the workout and then cool down, use a series of high-intensity songs mixed
with less-intense resting intervals, or just pick a random selection of songs with roughly
the same characteristics, as shown in figure 2.7.
Figure 2.7: Selecting workout pattern in Tangerine!.
After setting the criteria, Tangerine generates a playlist and displays it for user in-
spection. It also provides close iTunes integration, including the ability to load and save
playlists to iTunes, get the album arts from iTunes and export the BPMs to iTunes.
2.4 Reinforcement Learning
Reinforcement learning in computer science, inspired by psychological theory, is a
sub-area of machine learning concerned with how an agent ought to take actions in
2.4. REINFORCEMENT LEARNING 15
an environment so as to maximize some notion of long-term reward. Reinforcement
learning algorithms attempt to find a policy that maps states of the world to the actions
the agent ought to take in those states. Thus, reinforcement learning is particularly
well suited to problems which include a long-term versus short-term reward trade-off.
It has been applied successfully to various problems, including robot control, elevator
scheduling, telecommunications, backgammon and chess [30].
In addition to typical approaches (e.g., content-based filtering and collaborative fil-
tering) for recommender systems [2], reinforcement learning has been used for recom-
mendation in several applications. WebWatcher [13] is a Web tour guide that exploits
Q-Learning to guide users to their desired pages. Pages correspond to states and hyper-
links to actions, rewards are computed based on the similarity of the page content and
user profile keywords. It is used for online information filtering in [35] whichmaintains
a profile for each user containing keywords of interests and updates each word's weight
according to the implicit and explicit feedbacks received from the user. The proposed
learning method showed superior performance in information quality and adaptation
speed to user preference in online filtering. A general framework is presented in [1],
which consists of a database of recommendations generated by various models and a
learning module that updates the weight of each recommendation by user feedback. In
[29] a travel recommendation agent is introduced which considers various attributes
for trips and customers, computes each trip's value with a linear function and updates
function coefficients after receiving each user feedback. The recommendation problem
is modeled as an Markov Decision Process (MDP) in [28]. The system's states corre-
spond to user's previous purchases, rewards are based on the profit achieved by selling
16 CHAPTER 2. RELATED WORK
the items and the recommendations are made using the theory of MDP and their novel
state-transition function. A reinforcement learning approach for usage-based web rec-
ommendation is presented in [31]. Instead of using the static patterns discovered from
web usage data, it learns to make recommendations based on the actions it performs
in each situation. The problem is modeled as Q-Learning while employing concepts
commonly applied in the web usage mining domain.
Chapter 3
Emotion-based Personalized
Automatic Playlist Generation
The goal of our emotion-based personalized automatic playlist generation is to learn a
user's preference about music emotion from his/her listening behavior, and to recom-
mend appropriate songs in a playlist. In the following sections, we explicitly define
our problem and present the proposed solution.
3.1 Problem Definition
We start the chapter by defining the terminology. S denotes the system assigned to
solve the emotion-based personalized APG problem. U denotes the user who interacts
with system S. Music collectionM = {m1, m2, · · · , mn} is a finite set of songs. An
episode is defined as a one-time listening period of music which starts from opening the
music player, continues with subsequent played songs, and ends by closing the music
17
18 CHAPTER 3. EMOTION-BASED PERSONALIZED APG
player. User operation set O = {replay, skip, rate} is a finite set of operations user U
can perform during an episode. Table 3.1 gives a complete summary of the notations.
Notation Definition DescriptionS The systemU The userM M = {m1, m2, · · · , mn} The music collectionm m ∈ M A song in the music collectionMO O = {replay, skip, rate} The user operation set.o o ∈ O A user operation in the user operation set Oe e = (o, t) An operation entryl l = (m, tm
s , tme ,Q) A listening log
Q Q = {e1, e2, · · · , en} A finite set of operation logsHt Ht = (l1, l2, · · · , lk) The listening history at time t
util(mt) The utility function
Table 3.1: The summary of the notations.
Definition 1. Operation Entry
An operation entry is a pair e = (o, t) , where o ∈ O is the user operation, and t is the
time of performing the user operation o.
Definition 2. Listening Log
A listening log is a quadruple l = (m, tms , tm
e ,Q), where m ∈ M denotes the played
song, tms denotes the starting time of playing song m, tm
e denotes the ending time of
playing songm, and Q = {e1, e2, · · · , en} denotes the finite set of operation entries.
For each individual playing of a song in an episode, a corresponding listening log
is recorded. The set of operation logs, Q, can contain zero or more logs.
3.1. PROBLEM DEFINITION 19
Figure 3.1: A sample user scenario.
Definition 3. Listening History
A listening history Ht = (l1, l2, · · · , lk) at time t is a temporally ordered sequence of
past listening logs, where li.tms ≤ lj.tm
s ≤ t for all i ≤ j ≤ k.
In the thesis, we assume the interaction between the user U and our APG system S
in a listening episode follows the scenario:
1. An episode starts with an initial song, which can be either user-specified or sys-
tem generated.
2. S dynamically plays the next recommended songmi ∈ M in the playlist.
3. During the episode, U can perform one of the three operations oj ∈ O to reveal
his/her preference about the song currently played.
4. Each listening log l will be logged in the listening history H immediately.
Figure 3.1 illustrates a sample scenario.
In order to measure the effectiveness of recommended song mt by system S, the
notion of utility function should be defined. Different kinds of utility functions can be
20 CHAPTER 3. EMOTION-BASED PERSONALIZED APG
Figure 3.2: The System Architecture of MEonPlay Automatic Playlist Recommender.
defined for effectiveness measurement (to be elaborated in 5.3).
Definition 4. Utility Function
Given a recommended songmt at time t, the utility function util(mt) is used to measure
the effectiveness of the recommendation.
The specific problem our system is assigned to solve can be defined as follows:
Definition 5. The Problem
Given the music collectionM, at time t, according to user U 's current listening history
Ht, system S has to dynamically generate the next song mt ∈ M, such that the utility
function util(mt) is optimized.
3.2. PROPOSED SOLUTION 21
3.2 Proposed Solution
3.2.1 MEonPlay Automatic Playlist Recommender
In this thesis, we propose a system named MEonPlay1 Automatic Playlist Recom-
mender to provide personalized automatic playlist generation on the basis of users'
preference of music emotion. Figure 3.2 illustrates the system architecture of the sys-
tem.
Fourmajor components are contained inMEonPlay, includingMusic EmotionModel
(MEM), Listening LogRepository (LLR), User PreferenceModel (UPM), and Emotion-
based User Interface (EUI).
Music Emotion Model (MEM) maintains the function fme : M → E, which maps
songs to their corresponding emotions. Given a songm ∈ M, MEM is used to estimate
the emotion of the song. Russell's two-dimensional (valence-arousal) model of emo-
tions is adopted here to represent the concept of emotion E. Listening Log Repository
(LLR) maintains the user's current listening historyH. User's subsequent listening logs
are added in H immediately.
Given MEM and LLR, User Preference Model (UPM) maintains the function fhm :
Ht → M, which maps the listening history at time t, Ht, to a recommended songm. It
aims to approximate user U 's real mapping f∗hm : Ht → M which is unknown by our
system S.
In addition to the existing graphical user interface of music players, MEonPlay
1The name MEonPlay is the abbreviation of "Music Emotion on Play", in which "ME" also involvesthe notion of "personalization".
22 CHAPTER 3. EMOTION-BASED PERSONALIZED APG
provides an alternative navigation interface, Emotion-based User Interface (EUI), to
allow users to explore their music collection in the emotion dimension. EUI provides
a two-dimensional navigation plane which corresponds to Russell's valence-arousal
model. The interaction between the user and the system is accomplished with the EUI,
including user query, system recommendation, and so on.
3.2.2 APG as a Reinforcement Learning Problem
As we mentioned in 1.2, the objective of our system is to provide personalized auto-
matic playlist generation. In our definition, the objective aims to learn the mapping
fhm : Ht → M in UPM to approximate the user's real preference f∗hm : Ht → M. In
other words, our system aims to learn how to generate the right songmt at time t, given
the user's current listening history Ht in UPM.
Unlike the typical classification problem, supervised learning cannot be applied in
this case, since we do not have the exact answer in every situation or even a static
one. The listening history Ht will evolve over time. Therefore, more importantly, our
system has to learn how to recommend a song under a certain situation (that is, the
listening historyHt at time t) so that the long-term results of recommendation (e.g., an
entire episode) can be optimized. Moreover, in the scenario of our APG problem, the
user's feedback (replay, skip, or rate) serves as a natural indicator to guide and reinforce
learning. Thus, we adopt reinforcement learning for learning the user's preference.
As stated in 2.4, reinforcement learning [30] is a research field of machine learn-
ing which aims to learn which actions an agent should take in an environment so as to
maximize long-term reward. In each step, the agent should take an action a and transit
3.2. PROPOSED SOLUTION 23
Figure 3.3: The agent-environment interaction in reinforcement learning.
from a state s to another state sÕ. After each transition, the agent receives a reward.
Figure 3.3 diagrams the concept of the agent-environment interaction in reinforcement
learning. The agent aims to learn a policy that defines which action should be taken in
each state in order to receive the greatest accumulative reward, along the path to the
goal state. Instead of static recommendation, we argue that it is more appropriate to
consider the APG problem as a sequential optimization problem, as the sequential na-
ture of the recommendation process take into account not only the utility of a particular
recommendation but also the long-term reward [28]. In particular, because in our prob-
lem the APG system concerns learning how tomake recommendations in each situation
according to the user's feedback, reinforcement learning provides appropriate learning
methods. In the following chapter, we formulate the APG task as a reinforcement learn-
ing problem and illustrate the states, actions, reward function, and the training process
in our formulation.
24 CHAPTER 3. EMOTION-BASED PERSONALIZED APG
Chapter 4
Emotion-Based Adaptive Preference
Model
To provide a personalized solution under the scenario described in the previous chap-
ter, we argue that it is more appropriate to consider APG as a continuous optimization
problem. In this chapter, we first describe our dataset used in theMusic EmotionModel
(MEM). We then propose an emotion-based adaptive preference model that adopts re-
inforcement learning methods to learn User's Preference Model (UPM) about music
emotion in a playlist. Hypothetical users are created for fine-tuning learning parame-
ters in our model.
4.1 Annotated Music Dataset
In this section, we describe the compilation of the annotated music dataset named
POP500, and the specific usage in our model.
25
26 CHAPTER 4. EMOTION-BASED ADAPTIVE PREFERENCE MODEL
4.1.1 Songs
The POP500 dataset contains 526 Chinese pop songs by 76 artists released between
2002 and 2008. We separately collect each song's audio track and lyrical text which
are used in the annotation procedure (to be elaborated later).
4.1.2 Paritcipants
More than 400 participants contribute to the annotation of POP500 dataset. Most of
them are 20 to 30-year-old college students who major in diverse disciplines. All the
participants are pre-trained to fully understand the experiment for annotation, including
the emotion model used and the annotation process (to be elaborated later).
4.1.3 Music Emotion Model
Russel's Valance-Arousal (V-A) emotion plane [27] model is adopted in the dataset.
Instead of using continuous two-dimensional vectors, each axis (valence/arousal) is
partitioned into five discrete values (−2, −1, 0, 1, 2), with 2 being the highest (plea-
sure in valence, and energetic in arousal), −2 being the lowest (displeasure in valence,
and tired in arousal), and 0 representing the neutral response. During the annotation
process, the participants are required to give V-A ratings to songs in the domain of these
five discrete values.
4.1. ANNOTATED MUSIC DATASET 27
4.1.4 Annotation Process
To further examine the individual contribution of lyrical text and music track to music
emotion, we separately collect the emotion ratings of the two parts. In the annotation
process, we define three different sessions for collecting emotion ratings of lyrical text
only (L), music track only (M), and both (ML). The participants can only see the lyrical
text in the L session, hear only the music track in theM session, while they can percieve
both sets of information in theML session. Note that in theM session, we do not remove
the vocals from the music track, since we consider the vocal as an important component
in music track to convey emotion. In each session, the participants are asked to give a
V-A rating to the song. Besides the ratings, additional information are also collected in
the process, including familiarity and preference level of a song. Figure 4.1 shows the
snapshot of the user interface for annotation in the three sessions.
(a) The L session (Chinese) (b) The M session (Chinese) (c) The ML session (Chinese)
Figure 4.1: The snapshot of the user interface for annotation in the three sessions.A two-dimensional V-A plane, partitioned into five discrete regions in each axis, isprovided for rating the emotion. The participant can directly select a region to specifythe V-A values.
28 CHAPTER 4. EMOTION-BASED ADAPTIVE PREFERENCE MODEL
Quadrant (+, +) (-, +) (-, -) (+, -)
Class I II III IV
Number 213 212 52 49
Percentage 40% 40% 10% 10%
Table 4.1: Number of songs and percentage
across four emotion classes (refer to Figure
4.2).
Figure 4.2: Four emotion classes
categorized by the cooresponding
quadrants on the V-A plane.
4.1.5 Usage of the POP500 Dataset
Each song in the dataset is associated with the emotion ratings (V-A value) that was
manually annotated by pre-trained users. In our system, we categorize all the songs in
the dataset into 4 classes according to which quadrants they belong to. Table 4.1 shows
the number of songs and percentages across 4 emotion classes in the dataset.
4.2 PreferenceModeling with Reinforcement Learning
Reinforcement learning is adopted here to learn a personalized preference model in
UPM for playlist generation. In the following sections, we explicitly describe the prob-
lem formulation, learning methods, and parameter tuning.
4.2. PREFERENCE MODELING WITH REINFORCEMENT LEARNING 29
4.2.1 Modeling States and Actions
To learn the user's preference for transition pattern of music emotion, the states should
represent the user's accumulative and current listening experience, i.e., past listening
sequence. One possible approach is to allow the states to contain any sequence of music
emotion listened by the user so far. Of course, with this approach we would face an
infinite number of states which would make the learning process difficult to converge.
In our approach, we restrict our history of states in the learning process to a manageable
number. For this purpose, we adopt the N-Grams model which is commonly used in
natural language processing to predict the next item in a sequence. In our model, a
sliding window of sizem is used as the user's listening history. That is, a state contains
only the last m songs' emotion classes. The assumption of using the N-Grams model
is that knowing only the lastm songs' emotion classes gives us enough information to
predict user's next preferred musical emotion. With the definition of states, we define
the action as choosing an emotion class. Taking an action makes the sliding window
move ahead one step and update the state. Figure 4.3 illustrates an example. One tricky
issue worth considering is that if an action turns out to be a bad choice (the user does
not like the recommended song), should we move the sliding window anyway or leave
the state unchanged? In our experiment, we simply leave the state unchanged.
4.2.2 Designing the Reward Function
Using a reward function to measure the progress towards the goal is one characteristic
of reinforcement learning. In our APG problem, we use two reward function types:
30 CHAPTER 4. EMOTION-BASED ADAPTIVE PREFERENCE MODEL
Figure 4.3: State and action definition with window sizem = 3. The initial state startswith a song with emotion a. Taking an action makes the sliding window move aheadfor one step and the state changed.
implicit feedback (FI) and explicit feedback (FE). Implicit feedback includes the lis-
tening time (LT) and the number of replays (NR). Both of which may reflect the user's
appreciation of the recommended song. Also, users are allowed to rate each recom-
mended song and we use this user rating (UR) as the explicit feedback. The overall
reward function can be summarized as follows:
rt = δREt + (1 − δ)RI
t , 0 ≤ δ ≤ 1 (4.1)
whereREt is the explicit feedback andRI
t is the implicit feedback at time t. The param-
eter δ controls the relative contribution of each feedback. The reason why we define
two different components in the reward function is based on the assumption that the
relative significance of them is different. That is, explicit feedback might provide more
accurate information, while implicit feedback might contain more noise. The individ-
ual feedback is computed as follows:
REt =
ÿ
i∈FEcifit
RIt =
ÿ
j∈FIcjfjt
where FE = {UR} is the set of explicit feedback, FI = {LT,NR} is the set of implicit
feedback, fit is the score of feedback i at time t, and ci and cj stands for the weight of
4.2. PREFERENCE MODELING WITH REINFORCEMENT LEARNING 31
explicit feedback and implicit feedback, respectively.
4.2.3 Solving APG with temporal-difference learning
Several solution methods can be used for solving the reinforcement learning problem.
Temporal-difference (TD) learning is one of the fundamental methods concerning the
notion of delayed reward. The idea of delayed reward is that taking an specific action
may affect not only the immediate reward but also the next situation and so on. More-
over, the reward might not occur in every step, and might even be unpredictable. An
analogy of the delayed reward is to prepare for a midterm exam. A student might study
hard everyday for a month before the midterm. During the one-month preparation, the
student might not receive any kind of reward. The actual reward occurs one month later
(after passing the midterm). However, we cannot totally ignore the effort of studying
hard during the month. That is, all of the actions taken beforehand should somehow
share a portion of the delayed reward. Therefore, TD learning takes the mechanism of
propagating the delayed reward back to the previous states into consideration.
In addition, temporal-difference learning consists of the ideas of Monte Carlo and
Dynamic Programming (DP). That is, it not only can directly learn from experience
without knowing the model of the environment's dynamics, but also updates estimates
based in part on other learned estimates, without waiting for a final outcome (a process
known as bootstrapping).
In the APG problem, TD learning serves as a well-suited method to solve the prob-
lem, since we do not have the exact model of the environment dynamics beforehand,
e.g., the exact reward of performing an action in a state, and we want the learning to be
32 CHAPTER 4. EMOTION-BASED ADAPTIVE PREFERENCE MODEL
as fast as possible. Algorithm 1 specifies the TD learning in procedural form. Basically
it is composed of a nested loop. For each step (inner loop) in each new episode (outer
loop), four major actions should be performed: (1) choose an action for a state given
the policy (2) take the action (3) observe the responding reward and the next state (4)
update the action-value function (Q-value)
Algorithm 1 Temporal-difference Learning1: Initialize Q(s, a) arbitrarily, π to the policy to be evaluated2: repeat {for each new episode}3: Initialize s4: repeat {for each step in the episode}5: Choose action a given by π for s6: Take action a7: Observe reward r and next state sÕ
8: Update Q(s, a)9: s ← sÕ;10: until s is terminal11: until the end of learning
In our approach, two TD learning methods are used: Q-Learning and SARSA. Both
methods are primarily concerned with estimating the value of performing any action
in each state, known as the action-value function (Q-value). Their main difference is
that Q-Learning is an off-policy control method and SARSA is an on-policy control
method. We use the following update rule:
SARSA: Q(st, at) ← (1 − αn)Q(st, at) + αn[rt+1 + γQ(st+1, at+1)] (4.2)
Q-Learning: Q(st, at) ← (1 − αn)Q(st, at) + αn[rt+1 + γmaxa
Q(st+1, a)] (4.3)
4.2. PREFERENCE MODELING WITH REINFORCEMENT LEARNING 33
Algorithm 2 APG with SARSA method1: Initialize Q(s, a) arbitrarily2: repeat {for each episode}3: Initialize s4: Choose a from s using ‘-greedy policy5: repeat {for each step in the episode}6: Take action a, generate next song with g(a)7: Observe r and next state sÕ
8: Choose aÕ from sÕ using ‘-greedy9: Update Q with (4.2)10: s ← sÕ; a ← aÕ;11: until s is terminal12: until the end of learning
where αn =1
1 + VisitCount(s, a)
Here st, at, and rt denote the state, action, and reward at time t, respectively. The
αn is the learning rate and γ is the discount factor. The VisitCount(s, a) represents the
number of visits of Q(s, a). The decreasing value of αn helps the Q-value gradually
converge to the answer.
The online and incremental procedure of learning and playlist generation can be
summarized as algorithm 2 and 3:
Algorithm 2 provides the pseudocode of our APG solution utilizing the SARSA
method, where s and sÕ in line 3 and 7 denote current state and next state, a and aÕ in
line 4 and 8 denote the actions, r in line 7 denote the received reward. Q(s, a) represents
the action-value function of performing action a in state s. g(a) is the function used
to select a song given an action a, which is the chosen emotion in our definition (to
be elaborated later). As in all on-policy methods, we continually estimate Qπ for the
34 CHAPTER 4. EMOTION-BASED ADAPTIVE PREFERENCE MODEL
Algorithm 3 APG with Q-Learning method1: Initialize Q(s, a) arbitrarily2: repeat {for each new episode}3: Initialize s4: repeat {for each step in the episode}5: Choose a from s using ‘-greedy policy6: Take action a, generate next song with g(a)7: Observe r and next state sÕ
8: Update Q with (4.3)9: s ← sÕ;10: until s is terminal11: until the end of learning
behavior policy π, and at the same time change π toward greediness with respect to
Qπ.
Algorithm 3 provides the pseudocode of our APG solution utilizing the Q-Learning
method, where s and sÕ in line 3 and 7 denote current state and next state, a in line 5 de-
notes the actions, r in line 7 denote the received reward. Q(s, a) represents the action-
value function of performing action a in state s. g(a) is the function used to select a
song given an aciton a, which is the chosen emotion in our definition (to be elaborated
later). In this case, the learned action-value function, Q, directly approximates Q∗, the
optimal action-value function, independent of the policy being followed.
Since the action we defined is not which song should be recommended but which
song emotion should be recommended, a rule, g(a), deciding how to select a song from
an emotion category is required. In our method, we just randomly select one. Thus, the
APG system dynamically recommends the next song according to the current Q-value
(s) and the policy followed. Here we use the ‘-greedy policy. That is, most of the time
in a state the action with optimal Q-value would be taken, but with a small probability
4.2. PREFERENCE MODELING WITH REINFORCEMENT LEARNING 35
‘ a random action would be taken to help explore other states and converge. We use
the exponentially decayed ‘ value:
‘t = ‘0e−λt, λ> 0 (4.4)
Here ‘t is the ‘ value at time t, ‘0 is the initial value, and λ is the decay constant.
Selecting a higher ‘ at the beginning and reducing it over time help find the optimal
action earlier and also get good rewards in the long term.
4.2.4 Parameter Selection with Simulation
Before being evaluated by real users, we first use hypothetical users (HU) to evaluate
our approach under different settings, including various window sizes m in the state
definition and varying parameters in learning methods. Two different HUs, one with
simple behavior and the other with more complicated preferences, are defined as fol-
lows:
HU-1 The user always prefers songs with the same emotion class as the seed song.
Any song with different emotion is skipped.
HU-2 The user usually prefers a smooth emotion transition of songs in the playlist, but
with probability 10% would choose randomly. Here smooth emotion transition
means that the emotion class of any two contiguous songs in the playlist can at
most have different signs in one dimension of the V-A plane model, e.g., from
class I to class II or to class IV but not to class III.
In the hypothetical cases, only the implicit feedbacks are considered.
36 CHAPTER 4. EMOTION-BASED ADAPTIVE PREFERENCE MODEL
Figure 4.4: Learning curve of HU-1 (simple case). The horizontal axis represents thenumber of episodes. The vertical axis represents the Normalized Root Mean SquareError (NRMSE). Q-Learning converges much faster than SARSA, and the learningcurves of SARSA show a more obvious fluctuation. In addition, choosing differentwindow sizem can affect the time to convergence.
Different Window Sizes
In the state definition, we adopt the N-Grams model by setting the window size as m
in the user's listening history. The window size can be critical in affecting our system's
performance. Increasing m might provide more information, however it also enlarges
the state space and greatly increases the convergence time. For each HU, we automat-
ically generate 200 listening episodes, each contains 20 songs, with the two learning
methods under different window sizem = 1, 2, 3. The learning curve is drawed based
on the average of 100 times. Figures 4.4 and 4.5 illustrate the learning curves of HU-1
and HU-2, respectively.
The results first show that Q-Learning converges much faster than SARSA in both
HUs, and since the convergence properties of SARSA are related to the policy's depen-
dence on Q, the learning curves show a more obvious fluctuation. Second, in HU-2,
4.2. PREFERENCE MODELING WITH REINFORCEMENT LEARNING 37
Figure 4.5: Learning curve of HU-2 (complicated case). The horizontal axis representsthe number of episodes. The vertical axis represents the Normalized RootMean SquareError (NRMSE). Q-Learning converges much faster than SARSA, and the 10% proba-bility randomness causes the NRMSE higher and the fluctuation more obvious in bothlearning methods. In addition, choosing different window sizem can affect the time toconvergence.
a more complicated case than HU-1, the 10% probability randomness causes the Nor-
malized Root Mean Square Error (NRMSE) to become higher and the fluctuation more
obvious in both learning methods. Third, choosing different window sizesm can affect
the convergence time. For example, with tri-gram, even in the simpler HU-1, more
than 100 episodes are required for Q-learning to converge to NRMSE less than 0.1.
This might make it difficult to apply the learning algorithm to practical usage. Thus,
choosing an appropriate window size is critical in our approach. In our experiment, we
choose window size 2 for the balance of convergence time and modeling ability.
Different decay constant in ‘-greedy policy
The decay constant in formula (4.4) controls the decreasing rate of the ‘ value in the
‘-greedy policy. A higher value of the decay constant causes the policy to follow the
38 CHAPTER 4. EMOTION-BASED ADAPTIVE PREFERENCE MODEL
currently optimum actions sooner, but might not explore all the states enough times.
On the other hand, a lower value gives the policy higher propability to explore all
the states, but might not follow the optimum actions immediately. To determine an
appropriate value of decay constant, we gradually change the value in both HUs and
learning methods, and then observe the average reward per episode.
λ0.01 0.05 0.10 0.15 0.20
HU-1 SARSA 68.9 93.4 96.1 93.9 91.5Q-Learning 69.0 93.4 96.2 94.9 91.5
HU-2 SARSA 76.1 79.2 81.4 82.3 81.2Q-Learning 76.5 80.6 81.5 82.6 82.0
Table 4.2: Average reward per episode (normalized to range [0, 100]) with different λin both HUs and both methods.
Table 4.2 shows that in both HUs and both methods, in the beginning increasing λ
leads to higher average reward. However, after reaching the upper bound, the average
reward starts to drop as expected. In our experiment, we choose the λ value 0.1.
Chapter 5
Experimental Evaluation
To best evaluate our approach for APG, we deployed a real-user-based experiment
for 2 months, and compared different methods with several evaluation metrics. The
experiment details and results are elaborated in the following sections.
5.1 The Participants
There were total of 5 participants in the experiment. Most of themwere college students
aged from 20 to 30.
5.2 Experiment Design
The experiment was divided into training and testing phases. In the training phase,
participants were required to listen to at least 20 songs in each single episode, and
follow the scenario (described in Section 3.1) to give the system feedback during the
39
40 CHAPTER 5. EXPERIMENTAL EVALUATION
listening period. In addition, participants were asked to describe the listening context,
i.e., the activity while listening the music, of each listening episode. Both methods, Q-
Learning and SARSA, are applied to learn participants' preference for music emotion
transition.
In the testing phase, each episode starts with a seed song chosen by the partici-
pant. Then, one of the three methods, Shuffle, SARSA, and Q-Learning, are randomly
selected to generate the next song in the playlist. Participants followed the scenario
(described in Section 3.1) as in the training phase. The episode ends when the length
of the playlist reaches 20. The Shuffle (random generation) was chosen as a baseline
approach for comparison, since it is one of the most common APGmethods provided in
modern music players (e.g., iPod). Participants were not aware of the exact underlying
method and listened to the music as usual. Besides providing all the user feedbacks dur-
ing an episode, participants were asked to give a rating (from 1 to 5) to each generated
playlist for later evaluation.
Starting from the training phase, participants waited until the number of training
samples exceeds a certain threshold (20 in our experiment) to enter the testing phase.
On average, an episode with 20 songs lasted more than one hour, and participants per-
formed less than two episodes in a day. Therefore, the whole user experiment took
around two months to achieve a preliminary result. Figure 5.1 shows the histogram for
the episode count of each user in training phase and testing phase, respectively.
To reduce the complexity of the new user interface, we implemented our experi-
ment as a plug-in for Apple iTunes. Thus, the participants can provide more accurate
information using the same iTunes music player. Figure 5.2 demonstrates the snapshot
5.3. EVALUATION METRICS 41
Figure 5.1: The histogram for the episode count of each user in training phase andtesting phase, respectively.
of the interface of the experiment. Two playlists, training and testing, are automatically
created in the training phase and testing phase, respectively.
5.3 Evaluation Metrics
We use the following 4 metrics to evaluate each approach:
5.3.1 Miss Ratio
Miss Ratio measures the percentage of unsuccessfully recommended songs. It is cal-
culated as follows:
Miss Ratio =1
N
Nÿ
i=1
skip(i)
n(5.1)
42 CHAPTER 5. EXPERIMENTAL EVALUATION
Figure 5.2: The snapshot of the interface of the experiment.
, where skip(i) denotes the number of skips during episode i, n is the length of the
playlist (n = 20, in our experiment), and N is the total number of episodes.
5.3.2 Miss-to-Hit(k)
Miss-to-Hit (k) measures on average how many skips are required to get k successful
recommendations. It is defined as follows:
Miss-to-Hit (k) = (1
N
Nÿ
i=1
skip(i)
n − skip(i))k (5.2)
5.4. EVALUATION RESULTS 43
5.3.3 Listening-Time Ratio
Listening-Time Ratio is the average ratio of user's actual listening time to the total length
of a song, and is defined as follows:
Listening-Time Ratio =1
M
Mÿ
m=1
LT (m)
TT (m)(5.3)
, where LT (m) denotes the listening time of songm, TT (m) denotes the total time
of songm, andM is the number of played songs.
5.3.4 User Rating
User Rating is the average rating for a playlist explicitly given by users.
User Rating =1
N
Nÿ
i=1
rating(i) (5.4)
, where rating(i) denotes the rating for user i.
5.4 Evaluation Results
Table 5.1 lists the mean and standard deviation of Listening-Time Ratios of Shuf-
fle, SARSA, and Q-Learning methods. Q-Learning method outperforms Shuffle and
SARSA methods by showing a higher mean and a much lower standard deviation. The
SARSA and Shuffle methods show similar performance. Table 5.2 lists the mean and
standard deviation of the User Rating of the three methods. The result is the same as the
previous one, except that SARSA has a little better mean user rating than the Shuffle
44 CHAPTER 5. EXPERIMENTAL EVALUATION
method. The reason why SARSA did not perform as well as the Q-Learning method
may be due to its longer converge time and the fluctuation as described in Section 4.2.4.
Shuffle SARSA Q-LearningMean 80.56% 79.83% 82.91%STD 14.01% 14.31% 7.68%
Table 5.1: The mean and standard deviation of Listening-Time Ratio of Shuffle,SARSA, and Q-Learning methods.
Shuffle SARSA Q-LearningMean 2.67 2.83 3.32STD 0.94 1.09 0.65
Table 5.2: The mean and standard deviation of User Rating of Shuffle, SARSA, andQ-Learning methods; scoring range: [1, 5].
Figure 5.3 shows the results of Miss Ratio. We can see that Miss Ratio of both
SARSA and Q-Learning was better than Shuffle, and, in particular, Q-Learning out-
performed Shuffle by 10%. Figure 5.4 shows the result of Miss-to-Hit (20). In Miss-
to-Hit (k), the advantage becomes even larger. To generate a satisfactory playlist of
length 20, on average using Q-Learning users have to skip 9 songs, but using Shuffle
they have to skip nearly 15 songs. That is, to get a successfully recommended song,
using Shuffle users have to expend more than 1.5 times the amount of effort compared
to using Q-Learning method.
A major difference between a single recommendation and automatic playlist gen-
eration is that APG is concerned with the listening continuity in a playlist. To see the
5.4. EVALUATION RESULTS 45
Figure 5.3: The Miss Ratio of Shuffle, SARSA, and Q-Learning methods.
Figure 5.4: The Miss-to-Hit(20) of Shuffle, SARSA, and Q-Learning methods.
improvement in this aspect, we count the non-interrupted number of songs in a playlist
(the number of continuous playing of songs without interposed skip operations) in each
method. Figure 5.5 shows the percentage of the different number of continuous songs
in each method. In all the three methods, the continuous play of two songs dominates
over others. However, in SARSA and Q-Learning methods, the percentage of larger
continuous number is obviously higher than that in Shuffle method. Moreover, large
continuous number (> 10) occur more often in SARSA and Q-Learning than in Shuffle
46 CHAPTER 5. EXPERIMENTAL EVALUATION
Figure 5.5: The Continuous Play of Shuffle, SARSA, and Q-Learning methods.
method. For example, in Q-Learning the maximum number of continuous songs is 13,
while in Shuffle continuous number of songs larger than 9 never occurs.
The above results show the average performance of three methods. For further
examination of the application of our approach to different users and contexts, we cat-
egorize the data into two groups according to the listening context given by the users.
Figure 5.6 demonstrates the Miss-to-Hit Ratio of two users under the two contexts re-
spectively.
The result shows that under the same context, our approachmay perform differently
for different users. For example, under the Working Context, SARSA and Q-Learning
save nearly half the effort of the Random method for User 1. However, the context
makes no obvious difference to User 2. In addition, the result also shows that for
a specific user, our approach might perform differently under different contexts. For
example, for User 2, our approach provides little improvement under theWorking Con-
text, but significantly outperforms Random method under Leisure Context, especially
5.4. EVALUATION RESULTS 47
Figure 5.6: The Miss-to-Hit Ratio of User 1 and User 2 under Working and Leisurecontext respectively.
with the Q-Learning method. On the basis of this observation, the result suggests that
context is one of the critical factors of APG systems and it is also user-dependent.
To sum up, on average our approach for personalized APG shows certain improve-
ment in comparison with the baseline approach. In particular, the Q-Learning method
outperforms other methods in most of the evaluation metrics. The results also corre-
sponds to the simulation results with hypothetical users. Moreover, the influence of
listening context might serve as a preliminary investigation for future emotion-based
APG systems.
48 CHAPTER 5. EXPERIMENTAL EVALUATION
Chapter 6
Conclusion
Organizing and searching by the emotion of songs become a novel and natural trend for
music exploration. Inspired by the trend and requirement of music recommendation,
this thesis proposes an emotion-based adaptive preference model for automatic playlist
generation (APG). We argue that it is more appropriate to consider the APG problem
as a continuous optimization problem. First, we explicitly describe the user scenario
and formally define the problem to be solved. MEonPlay, an emotion-based personal-
ized APG system, is proposed as a solution for the problem. Reinforcement learning
is adpoted here to model the emotion-based personalized APG problem. We then de-
scribe how we formulate the problem as a reinforcement learning problem, including
state, action definition and reward function. Due to the characteristic of temporal-
difference (TD) learning and its conformity to the problem, two TD learning methods,
i.e., SARSA and Q-Learning, are applied for solving the problem. Furthermore, we
create hypothetical users to help fine-tune learning models' parameters, including the
window size in the state definition and decay constant in the ‘-greedy policy. At the
49
50 CHAPTER 6. CONCLUSION
end, we conduct a two-month user study to evaluate our approach. Several evaluation
metrics are defined to measure the successfulness of our playlist recommendation. The
results show that the Q-Learning approach outperforms the SARSA and Shuffle meth-
ods (a baseline approach) in both listening-time ratio and user rating measurement. On
the other hand, in miss-ratio and miss-to-hit(k) measurement, both Q-Learning and
SARSA show a superior performance over Shuffle. The number of continuous play
measurement demonstrates a preliminary result that our approach effectively extends
users' non-interrupted play of songs. An observation at the end also indicates the rela-
tionship between music emotion and listening context.
6.1 Summary of Contributions
In contrast to the previous research in automatic playlist generation, our contributions
can be summarized as follows. First, instead of using metadata or audio similarity,
we approach the APG problem from a differnt perspective. That is, according to the
emerging trend and requirement, we generate playlist based on songs' emotion. Sec-
ond, we consider the APG problem as a continuous optimization problem, and propose
the MEonPlay system for emotion-based personalized APG. Our novel approach of
emotion-based adaptive preference model can be utilized in future personalized APG
systems. Third, several evaluationmetrics are defined for measuring the successfulness
of the APG system. A real user study is further conducted to validate our approach.
6.2. FUTURE WORK 51
6.2 Future Work
This thesis demonstrates a preliminary results of pure emotion-based APG system. The
factors deciding user's listening preferences are complicated and user-dependent. How
to best leverage the multi-dimensional factors in our approach can be further examined
in the future. In addition, reducing the convergence time of learning is an important
issue for our approach to be applicable in a real situation. A future studymay investigate
methods to speed up both the learning and adaptation process.
Bibliography
[1] Reinforcement learning architecture for web recommendations. In Proceedings of the
International Conference on Information Technology: Coding and Computing (ITCC'04),
pages 398--402, 2004.
[2] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: a
survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge
and Data Engineering, 17(6):734--749, 2005.
[3] M. Alghoniemy and A. H. Tewfik. Personalized music distribution. In Proceedings of
International Conference on Acoustics Speech and Signal Processing, pages 2433--2436.
IEEE, 2000.
[4] M. Alghoniemy andA. H. Tewfik. User-definedmusic sequence retrieval. InProceedings
of the 8th ACM International Conference on Multimedia 2000, page 2000, 2000.
[5] M. Alghoniemy and A. H. Tewfik. A network flow model for playlist generation. In
Proceedings of IEEE International Conference on Multimedia and Expo., 2001.
[6] J.-J. Aucouturier and F. Pachet. Finding songs that sound the same. In Proceedings of the
1st IEEE Benelux Workshop on Model based Processing and Coding of Audio (MPCA-
2002), 2002.
[7] J.-J. Aucouturier and F. Pachet. Scaling up music playlist generation. In Proceedings of
IEEE International Conference on Multimedia and Expo., 2002.
52
BIBLIOGRAPHY 53
[8] C. Baccigalupo and E. Plaza. Case-based sequential ordering of songs for playlist rec-
ommendation. In In Proc. of the ECCBR ’06 Conference, 2006.
[9] O. Celma. Music Recommendation and Discovery in the Long Tail. PhD thesis, Univer-
sitat Pompeu Fabra, Barcelona, Spain, 2008.
[10] P. Ekman. An argument for basic emotions. Cognition and Emotion, 6(3/4):169--200,
1992.
[11] A. Flexer, D. Schnitzer, M. Gasser, and G. Widmer. Playlist generation using start and
end songs. In Proceedings of the 9rd International Conference on Music Information
Retrieval (ISMIR 2008), pages 173--178, 2008.
[12] T. Fritz, S. Jentschke, N. Gosselin, D. Sammler, I. Peretz, R. Turner, A. D. Friederici, and
S. Koelsch. Universal recognition of three basic emotions in music. Current Biology,
2009.
[13] T. Joachims, D. Freitag, and T. Mitchell. Webwatcher: A tour guide for the world wide
web. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelli-
gence, pages 770--775. Morgan Kaufmann, 1997.
[14] P. N. Jusling and J. A. Sloboda. Music and Emotion - Theory and Research. Series in
Affective Science. Oxford University Press, New York, 2001.
[15] G. Kreutz, U. Ott, D. Teichmann, P. Osawa, and D. Vaitl. Usingmusic to induce emotions:
Influences of musical preference and absorption. Psychology of Music, 36(1):101--126,
January 2008.
[16] T. Li andM.Ogihara. Detecting emotion inmusic. InProceedings of the 3rd International
Conference on Music Information Retrieval (ISMIR 2003), 2003.
[17] D. Liu, L. Lu, and H. J. Zhang. Automatic mood detection from acoustic music data. In
ISMIR, 2003.
54 BIBLIOGRAPHY
[18] B. Logan. Content-based playlist generation: Exploratory experiments. In Proceedings
of the 3rd International Conference on Music Information Retrieval (ISMIR 2002), 2002.
[19] F. Pachet, P. Roy, and D. Cazaly. A combinatorial approach to content-based music
selection. Multimedia, IEEE, 7(1):44--51, 2000.
[20] E. Pampalk, T. Pohle, and G. Widmer. Dynamic playlist generation based on skipping
behavior. In Proceedings of the 6rd International Conference on Music Information Re-
trieval (ISMIR 2005), pages 634--637, 2005.
[21] S. Pauws and B. Eggen. Pats: Realization and user evaluation of an automatic playlist
generator. In Proceedings of the 3rd International Conference on Music Information
Retrieval (ISMIR 2002), pages 222--230, 2002.
[22] S. Pauws, W. Verhaegh, and M. Vossen. Fast generation of optimal music playlists using
local search. In Proceedings of the 7th International Conference on Music Information
Retrieval (ISMIR 2006), pages 138--143, 2006.
[23] S. Pauws, W. Verhaegh, and M. Vossen. Music playlist generation by adapted simulated
annealing. Information Science, 178(3):647--662, 2008.
[24] J. C. Platt, C. J. C. Burges, S. S., W. C., and Z. A. Learning a gaussian process prior for au-
tomatically generating music playlists. In In Advances in Neural Information Processing
Systems, pages 1425--1432. MIT Press, 2002.
[25] T. Pohle, E. Pampalk, and G. Widmer. Generating similarity-based playlists using travel-
ing salesman algorithms. In Proceedings of the 8th International Conference on Digital
Audio Effects (DAFx'05), 2005.
[26] R. Ragno, C. Burges, and C. Herly. Inferring similarity between music objects with ap-
plication to playlist generation. In Proceedings of the 7th ACM SIGMM International
Workshop on Multimedia Information Retrieval, 2005.
BIBLIOGRAPHY 55
[27] J. A. Russell. Affective space is bipolar. Journal of Personality and Social Psychology,
37(3):345--356, 1979.
[28] G. Shani, D. Heckerman, and R. I. Brafman. An mdp-based recommender system. Jour-
nal of Machine Learning Research, 6:1265--1295, 2005.
[29] A. Srivihok and P. Sukonmanee. Intelligent agent for e-tourism: Personalization travel
support agent using reinforcement learning. In Proceedings of the WWW 2005, 2005.
[30] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1st
edition, 1998.
[31] N. Taghipour, A. Kardan, and S. . Ghidary. Usage-based web recommendations: a rein-
forcement learning approach. In RecSys '07: Proceedings of the 2007 ACM conference
on Recommender systems, pages 113--120. ACM, 2007.
[32] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Towards musical query-by-
semantic-description using the cal500 data set. In SIGIR '07, pages 439--446. ACM,
2007.
[33] M. Wang, N. Zhang, and H. Zhu. User-adaptive music emotion recognition. In ICSP,
2004.
[34] Y. H. Yang, Y. C. Lin, Y. F. Su, and H. H. Chen. A regression approach to music emotion
recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 16(2):
448--457, 2008.
[35] B.-T. Zhang and Y.-W. Seo. Personalized web-document filtering using reinforcement
learning. Applied Artificial Intelligence, 15:665--685, 2001.
56 BIBLIOGRAPHY