National Taiwan University Master Thesisarchive.music.ntnu.edu.tw/chimeitp/brain/files/brain/brain-sub05.pdf · 數位化的時代來臨，伴隨網際網路普及的推波助瀾，

國立臺灣大學電機資訊學院資訊工程學系

碩士論文Department of Computer Science and Information Engineering

College of Electrical Engineering and Computer Science

National Taiwan University

Master Thesis

基於學習音樂情緒轉變之

個人化播放清單推薦系統

Learning Emotion Transitions forA Personalized Playlist Recommender

紀忠毅Chung-Yi Chi

指導教授：許永真博士、薛智文博士

Advisor: Jane Yung-jen Hsu, Ph.D.

Chih-Wen Hsueh, Ph.D.

中華民國九十八年六月June 2009

ii

Acknowledgments

三萬六千日，夜夜當秉燭。兩年充實的研究生活，一份豐碩的研究成果。困

惑與挫折不曾間斷，感謝許多摯友一路上的陪伴與支持，攜我一一克服。

謝謝我的朋友，雅晴、曜安、炳傑與鴻銘。研究的路上得以互吐苦水，切

磋砥礪。一路的陪伴，不時的關心，與實驗上的建議。

謝謝實驗室的所有夥伴，讓身邊永遠充滿研究的氣息與愉悅的氣氛。謝謝

婉容學姊長久以來的協助，任何疑難雜症都能順利解決。謝謝翰文從大學以來

的幫助與包容，總是不厭其煩幫忙我解決任何問題。謝謝 iPlayr 研究小組映

嫻、薇蓉與居正每週五晚上不懈的討論與建議。謝謝冠鋆與 Todd 協助修訂英

文用字，使論文更臻完美。謝謝所有幫忙蒐集實驗資料的同學，嘉涓學姊、

啟嘉學長、于晉、皓遠、琮傑、中川、庭嫣、守壹與彥伶。

謝謝指導老師許永真教授與薛智文教授。於研究上，提供莫大的協助，使

我能在迷失方向時釐清目標，在遭遇瓶頸時指點迷津；於生活中，提供中肯的

建議，面對不可預期的未來與內心的徬徨無助，仍以樂觀積極的態度，專心致

志。謝謝中研院許鈞南教授、交通大學陳穎平教授、元智大學蔡宗翰教授，

以及台灣大學吳家麟教授擔任口試委員，於口試過程中提供諸多寶貴的建議。

非學無以廣才，非志無以成學。期許自己熱忱、努力、堅持，未來的每一

件事。謹將這份成果獻給我的家人與所有關心我的人。

iii

iv

Abstract

The digitization and online distribution of music content in the Internet era have led to

an enormous volume of accessible digital music and diversified the ways with which

consumers explore music. In particular, an emerging trend in music exploration is to

organize and to search songs according to song emotions. However, research on Auto-

matic Playlist Generation (APG) primarily focuses on leveraging traditional metadata

and audio similarity for recommendation. Moreover, Mainstream solutions view APG

as a static problem.

This thesis argues that the APG problem is better modeled as a continuous opti-

mization problem, and proposes an adaptive preference model for personalized APG

based on emotions. The main idea is to collect a user's behavior in music playing (e.g.,

rating, skipping and replaying) as immediate feedback in learning the user's preferences

for music emotion within a playlist.

Reinforcement learning is adopted to learn the user's current preferences, which

are used to generate personalized playlists. Learning parameters are tuned via simula-

tion of two hypothetical users. Several evaluation metrics are defined to measure the

performance of our approach. A two-month user study is conducted to evaluate the

APG solutions. The results show that in most of the evaluation metrices the proposed

v

approach presents a superior performance in comparison with the baseline approach.

Keywords: Automatic Playlist Generation, Music Recommender System, Music

Emotion Estimation, Reinforcement Learning, Machine Learning

vi

摘要

數位化的時代來臨，伴隨網際網路普及的推波助瀾，數位音樂呈現爆炸式

的成長，人們接觸音樂的方式也漸趨多元。特別的是，利用音樂情緒來組

織與搜尋歌曲成為一種新興的趨勢。然而，目前針對音樂播放清單自動產生

（Automatic Playlist Generation）的研究，仍著重於利用傳統的音樂屬性與訊

號分析提供推薦。而且，大多數研究也將此問題視為一種靜態的推薦。

在本論文中，我們認為此類問題較適合採用連續性的最佳化問題來詮釋，

並於文中提出適應性的使用者喜好模型，以提供個人化的音樂播放清單推薦服

務。主要概念為透過蒐集使用者聆聽音樂的行為（如評分、略過與重播）當作

即時回應，從中學習其對於播放清單中，音樂情緒轉變的喜好。

我們採用強化式學習（Reinforcement Learning）來學習使用者的喜好，並

產生播放清單的推薦。此外，透過兩組使用者模擬的案例最佳化學習參數。

我們定義多個評估指標以衡量並比較不同推薦方法的優劣。最後，透過兩個多

月的使用者研究，實際觀察不同推薦方法的適用性。研究結果顯示，本論文提

出的推薦方法在大部份的評估指標中，都擁有優於基準方法的表現。

關鍵詞：播放清單自動產生、音樂推薦系統、音樂情緒判斷、強化式學

習、機器學習

vii

viii

Contents

Acknowledgments iii

Abstract v

List of Figures xii

List of Tables xiv

Chapter 1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Chapter 2 Related Work 5

2.1 Music Emotion Models . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Music Emotion Estimation . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Automatic Playlist Generation . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Related Products . . . . . . . . . . . . . . . . . . . . . . . . 8

ix

2.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 14

Chapter 3 Emotion-Based Personalized APG 17

3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 MEonPlay Automatic Playlist Recommender . . . . . . . . . 21

3.2.2 APG as a Reinforcement Learning Problem . . . . . . . . . . 22

Chapter 4 Emotion-Based Adaptive Preference Model 25

4.1 Annotated Music Dataset . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.1 Songs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1.2 Paritcipants . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1.3 Music Emotion Model . . . . . . . . . . . . . . . . . . . . . 26

4.1.4 Annotation Process . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.5 Usage of the POP500 Dataset . . . . . . . . . . . . . . . . . 28

4.2 Preference Modeling with Reinforcement Learning . . . . . . . . . . 28

4.2.1 Modeling States and Actions . . . . . . . . . . . . . . . . . . 29

4.2.2 Designing the Reward Function . . . . . . . . . . . . . . . . 29

4.2.3 Solving APG with temporal-difference learning . . . . . . . . 31

4.2.4 Parameter Selection with Simulation . . . . . . . . . . . . . . 35

Chapter 5 Experimental Evaluation 39

5.1 The Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

x

5.2 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3.1 Miss Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3.2 Miss-to-Hit(k) . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3.3 Listening-Time Ratio . . . . . . . . . . . . . . . . . . . . . . 43

5.3.4 User Rating . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.4 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Chapter 6 Conclusion 49

6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 50

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Bibliography 52

xi

List of Figures

2.1 James Russell's two-dimensional (valence-arousal) model of emotions. 6

2.2 Three steps to create playlists with iTunes Genius . . . . . . . . . . . 9

2.3 A snapshot of iTunes Genius. . . . . . . . . . . . . . . . . . . . . . . 10

2.4 The snapshot of Mirage. . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 The snapshot of Tangerine!. . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Selecting BPM, beat intensity range, and playlist duration in Tangerine!. 14

2.7 Selecting workout pattern in Tangerine!. . . . . . . . . . . . . . . . . 14

3.1 A sample user scenario. . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 The SystemArchitecture ofMEonPlayAutomatic Playlist Recommender. 20

3.3 The agent-environment interaction in reinforcement learning. . . . . . 23

4.1 The snapshot of the user interface for annotation in the three sessions. 27

4.2 Four emotion classes categorized by the cooresponding quadrants on

the V-A plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 State and action definition. . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 Learning curve of HU-1 (simple case). . . . . . . . . . . . . . . . . . 36

xii

4.5 Learning curve of HU-2 (complicated case). . . . . . . . . . . . . . . 37

5.1 The histogram for the episode count of each user in training phase and

testing phase, respectively. . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 The snapshot of the interface of the experiment. . . . . . . . . . . . . 42

5.3 The Miss Ratio of Shuffle, SARSA, and Q-Learning methods. . . . . 45

5.4 The Miss-to-Hit(20) of Shuffle, SARSA, and Q-Learning methods. . . 45

5.5 The Continuous Play of Shuffle, SARSA, and Q-Learning methods. . 46

5.6 TheMiss-to-Hit Ratio of User 1 and User 2 underWorking and Leisure

context respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xiii

List of Tables

3.1 The summary of the notations. . . . . . . . . . . . . . . . . . . . . . 18

4.1 Number of songs and percentage across four emotion classes (refer to

Figure 4.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Average reward per episode (normalized to range [0, 100]) with differ-

ent λ in both HUs and both methods. . . . . . . . . . . . . . . . . . . 38

5.1 The mean and standard deviation of Listening-Time Ratio of Shuffle,

SARSA, and Q-Learning methods. . . . . . . . . . . . . . . . . . . . 44

5.2 The mean and standard deviation of User Rating of Shuffle, SARSA,

and Q-Learning methods; scoring range: [1, 5]. . . . . . . . . . . . . 44

xiv

Chapter 1

Introduction

The sweet and passionate melody captivated his heart from the first note;

it was full of radiance, full of the tender throbbing of inspiration and hap-

piness and beauty, continually growing and melting away; it rumoured of

everything on earth that is dear, secret and sacred to mankind; it breathed

of immortal sadness and it departed from the earth to die in the heavens.

Ivan Turgenev, Home of the Gentry

1.1 Motivation

Thanks to the advance of the Internet and storage technology, the availability of low-

cost music has been massively increased in recent years. The way in which people

listen to and create music have been altered due to several trends, including the shift

from physical media to digital media and the dramatic drop in cost to produce and share

newmusic in theWeb 2.0 era. Today there are over 10 million different tracks available

1

2 CHAPTER 1. INTRODUCTION

on Apple's iTunes Store, with more than 6 billion songs sold since the service began on

April 28, 20031. Also, on peer-to-peer (P2P) networks there are more than 15 billion

tracks available for download. In the future these figures will undoubtedly seemminus-

cule. All music will be online, billions of new tracks will be available, and millions of

new arrivals will pour in everyday. The resulting problem is that finding the right mu-

sic is difficult. Besides, an analysis of 5600 iPod users2 has showed that 80% of plays

were from 23% of songs and the 64% of songs in their iPods were never played. This

phenomenon corresponds to the Long Tail, which was introduced by Chris Anderson

in 20043 to describe the niche strategy of businesses that sell a large number of unique

items, each in relatively small quantities. The issues of the music long tail are further

explained in [9]. Thus, there is an obvious need to assist people, not only consumers

but producers of music, to filter, discover, personalize, and recommend music from the

ever-growing sea of song tracks.

To solve the problem of tedious and time-consuming manual selection of music,

automatic playlist generation (APG) becomes one of the mainstream research. Various

methods for exploring and searching personal music collection are provided in cur-

rent music players. Users can search their song collections by specifying certain song

properties, including genres, artists, and albums. A novel and natural trend for music

exploration is organizing or searching by song emotion. The beautiful quotation at the

beginning of this chapter is from the book Home of the Gentry written by Ivan Tur-

genev. The protagonist of the novel is touched to the very depths of his soul by a piece

1http://www.techcrunch.com/2009/01/06/itunes-sells-6-billion-songs-and-other-fun-stats-from-the-philnote/

2http://blogs.sun.com/plamere/entry/what s on your ipod3http://www.wired.com/wired/archive/12.10/tail.html

1.1. MOTIVATION 3

of music being played on the piano. The elegant passage eloquently reveals the mys-

tical relationship between music and the human mind. Treatment of music emotion is

not limited to 19th-century literature however. Our requirements for daily music listen-

ing actually indicate the need for emotion-based music exploration. Imagine that after

a tiring workday, Johnny, an exhausted engineer, looks first for some relaxing music,

and would later like to enjoy some upbeat music afterward while planning a weekend

trip. In addition, the music emotion preference might also change over time. For ex-

ample, due to an approaching project deadline, Johnny might prefer more calm music

to help him concentrate on work, even though normally he would prefer exhilarating

music when not faced with a deadline.

With the advance of Emotion-basedMusic Information Retrieval (EMIR), emotion-

based personal playlist generation has becomes an emerging research filed. However,

little APG research takes music emotion into consideration. Most implementation cre-

ate playlists satisfying certain constraints given by users and systems, e.g. temporal

order of metadata or audio similarity [7, 6, 25] and temporal structure matching based

on patterns of previous cases [8]. Moreover, typical APG systems view the problem as

a static recommendation process and do not take the temporal effect of user preference

into consideration. Thus, current approaches of APG and music recommendation may

not provide an ideal solution.

4 CHAPTER 1. INTRODUCTION

1.2 Objectives

This thesis aims to provide a novel emotion-based personalized playlist recommender,

with two major objectives. One objective is to leverage music emotion as a major di-

mension for music exploration and recommendation. An investigation is required to

explore the application of music emotion. The second objective is to provide person-

alized automatic playlist generation. Here personalized APG consists of two require-

ments: (1) The recommendation should be tailored according to an individual user's

preferences. (2) The recommendation should be adaptive to the individual user's pref-

erences as they change over time. By incorporating these aspects into our emotion-

based personalized playlist recommender, we expect to attain a degree of improvement

over current APG services.

1.3 Thesis Structure

The rest of the thesis is organized as follows: The next chapter provides an overview

of the related work, including music emotion models, music emotion estimation, au-

tomatic playlist generation, and reinforcement learning. Chapter 3 explicitly defines

the problem and proposes our solution to solve the problem. Section 4 presents how

we model the APG as a reinforcement learning problem, and solve it with temporal-

difference learning methods. Using simulation to aid in tuning learning parameters is

also described in this chapter. Section 5 shows the results of experimental evaluation

by conducting a two-month user study. The last chapter summarizes our work with a

conclusion and provides suggestions for future work.

Chapter 2

Related Work

In this chapter, we will give an overview of related work in several topics, including

music emotion model, music emotion estimation, automatic playlist generation, and

reinforcement learning.

2.1 Music Emotion Models

Previous research has established a strong relationship between music and emotion

[14]. There are two aspects of how humans perceive emotion from music. (1) Ob-

jective: music can convey universal emotions [12]. (2) Subjective: music can evoke

different emotions for different people [15]. Modeling of emotion can be either cate-

gorical or vector. Categorical representation maps emotion into a number of different

classes. For example, there are anger, fear, sadness, happiness and disgust in Paul

Ekman's model of basic emotions [10]. The categorical model is the most commonly

used because of its clear definition and goal oriented nature. However, inflexibility

5

6 CHAPTER 2. RELATED WORK

of scaling the emotional coverage is the major limitation of the categorical model. In

contrast, the vector model represents emotions as points on a vector plane. Figure 2.1

illustrates the concept of James Russell's valence-arousal model of emotion [27]. The

plane is divided into quadrants: (1) positive valence with high arousal (Pos-High), (2)

negative valence with high arousal (Neg-High), (3) negative valence with low arousal

(Neg-Low), and (4) positive valence with low arousal (Pos-Low). A vector model has

the advantage of eliminating ambiguity between categories. For example, the ambigu-

ous meaning of cheerful and carefree may lower annotation consistency. In addition,

annotators may freely rate a song on a mood plane without being limited by pre-defined

tags [34].

Figure 2.1: James Russell's two-dimensional (valence-arousal) model of emotions. Thehorizontal axis (valence) represents pleasure/displeasure; the vertical axis (arousal) rep-resents low/high arousal.

2.2. MUSIC EMOTION ESTIMATION 7

2.2 Music Emotion Estimation

Music Emotion Estimation (MEE) is essential for music recommendation/retrieval

based on matching the emotion of music with that of the user. Most current MEE

systems adopt audio-based approaches [17, 16, 33, 34]. In general, there is a lack of

publicly available datasets for comparison of systems performance. The Computer Au-

dition Lab 500-song (CAL500) dataset was released by Turnbull in 2007 [32]. Thirty-

six emotion tags were defined and used in annotation of the songs. The CAL500 dataset

was used to train a Supervised Multi-class Labeling model (SML), which assigns mul-

tiple pre-defined tags to an input song with aMean Average Precision (MAP) of 50.6%.

CAL500 has been widely used in training and evaluating MEE systems.

2.3 Automatic Playlist Generation

The problem of playlist generationwas first introduced by Pachet et al. [19]. According

to the type of interaction with the user, current approaches of Automatic Playlist Gen-

eration (APG) can be categorized into two major forms: hint-based playlist generation

and constraint-based playlist generation. In hint-based playlist generation approach,

the user is required to specify one or more songs as seed songs. The playlist is then

generated to consist of songs similar to the seed songs. Different kinds of similarity

measures can be used in this approach, including metadata-based similarity [21, 24, 26]

and audio-based similarity [18, 20]. The major limitation of the hint-based playlist gen-

eration is the problem of having a highly uniform list of songs that may not be favored

by every user, especially when the length of the playlist is extensive. In the constraint-


based playlist generation approach, the user is allowed to explicitly specify the con-

straints that the generated playlist has to satisfy, mainly in three aspects: songs, order,

and length. Rich metadata are used in [7, 4, 3, 5] to describe user-specified constraints,

e.g., fast tempo, or at least 40% country pop music. Audio similarity is also included as

another constraint in [6, 11], e.g., timbre continuity throughout the playlist and smooth

transition from the first song to the last song. Several formulation of this problem have

been proposed, including linear programming [4, 3], stimulated annealing [22, 23], and

traveling salesman problem [25].

2.3.1 Related Products

iTunes Genius

Apple iTunes Genius1 was a new feature first introduced in iTunes 8 to recommend

playlists. Two major functionalities are provided in iTunes Genius: Genius Playlist

and Genius Sidebar.

Genius Playlist, as implied in the name, is a method to create playlists. Select a

song, click the Genius button, and iTunes generates a playlist of songs from your library

that complement that song. In this way, Genius playlists help users to discover songs

in their library they never knew they had, and rediscover forgotten favorites. Figure

2.2 illustrates the three steps involved creating Genius playlists.

Genius Sidebar is a method to introduce users to newmusic, movies, and TV shows.

Select a song, movie, or show in your library, and the Genius sidebar will display rec-

1http://www.apple.com/itunes/features/#genius

2.3. AUTOMATIC PLAYLIST GENERATION 9

Figure 2.2: Three steps to create playlists with iTunes Genius: (1) Select a song. (2)Click the Genius button. (3) Genius creates a playlist.

ommendations from the iTunes Store which complement the selection. The Genius

sidebar will not recommend anything already in your library, and you can preview and

buy recommended items directly from the sidebar. Figure 2.3 is a snapshot of iTunes

Genius. In the middle of the iTunes window is the generated playlist based on the song

The Rose by Bianca Ryan. At the top-right region, users can either change the length of

the playlist (limiting to 25, 50, 75, or 100 songs), refresh, or save the playlist. The Ge-

nius Sidebar is located at the right side of the window, and displays recommendations

from the iTunes Store, including music, movies, and TV shows.

MusicIP MyDJ

The MusicIP2 MyDJ Plug-in, which works with several existing music players (in-

cluding iTunes, Windows Media Player, and WINAMP), instantly generates playlists

based on the selected song. The MyDJ Plug-in anonymously analyzes users' songs for

2http://www.musicip.com/


Figure 2.3: A snapshot of iTunes Genius. In the middle of the window is the GeniusPlaylists that generate playlists based on the selected song. The Genius Sidebar islocated at the right side of the window, and it display recommendations from the iTunesStore that complement the selected song.

acoustic attributes and defining characteristics, allowing it to make intelligent mixes

and recommendations that go beyond genre classifications and editorial review. Once

analysis is complete, music is ready for playlisting. A user may select a song to gener-

ate a playlist of similar music from his or her libraries. While users are enjoying com-

pelling music combinations and rediscovering all the great music they already have,

MyDJ will introduce users to new music based on the songs they love most. MyDJ

also provides Software Development Kits (SDK) for digital devices, music softwares,

and websites to automatically generate playlists and other features.


MoodLogic

MoodLogic3 was one of the first online music recommendation systems. The com-

pany obtained ratings on over 1 million songs by over 50,000 distinct listeners as part

of its proprietary method for modeling user preference space. In addition to their web

presence, the company created a software application that uses a central database to

allow users to collaboratively profile music by mood. Each user has a certain num-

ber of "credits" they can use to identify song profiles. Credits could be obtained by

either paying for them or profiling songs. This software allowed the user to generate

"mood" based playlists based on the mood of the user. The program could also mix

a playlist based on a selected song. This would return a playlist with songs of sim-

ilar tempo, mood, genre, etc. Moodlogic was acquired by All Media Guide (AMG),

the company that runs allmusic.com, in May 20064. It is not yet clear whether AMG

intends to resume development and reactivate the community. The MoodLogic site

resolves to Macrovision's site in March, 2008. Effective March 3, 2008, Macrovision

announced the End Of Life (EOL) of the Moodlogic music management and recom-

mendation software. Service was discontinued due to intensive operational and infras-

tructure resources required to sustain the application. Macrovision's efforts in music

recommendation will continue through the AMG Data Services Tapestry business-to-

business product5.

3http://www.moodlogic.com/4http:// www.cbronline.com/ article news.asp? guid=A47A2ACA-5830-4C10-B8A1-

872B5FD4D6CF5http://en.wikipedia.org/wiki/MoodLogic


Mirage

Figure 2.4: The snapshot of Mirage.

Mirage6, an automatic playlist generation extension for Banshee, is an implementa-

tion of research in APG and music similarity. Mirage analyzes a user's music collection

and computes acoustic similarity models for each song. After the user's music collec-

tion has been analyzed, Mirage is able to automatically generate playlists of similar

music. In this project, Mirage was integrated into the popular GNOME audio player

Banshee7.

Music information retrieval techniques are used in Mirage to compute a similarity

model for each song. This process includes the computation of the Fast Fourier Trans-

form (FFT), the Mel Frequency Cepstrum Coefficients (MFCC) for psycho-acoustic

modeling, and amultidimensional GaussianMixtureModel (GMM) to finally represent

a song with a timbre/similarity model. After the whole music collection is analyzed,

6http://hop.at/mirage/7http://www.banshee-project.org/


each song has a similarity/timbre model attached to it. Then users can start generat-

ing playlists by selecting a song (the seed song) they want the playlist to begin with

and Mirage searches all its models for similar songs. To do so, the Gaussian models

computed in the previous step are compared using the an optimized Kullback Leibler

divergence.

Tangerine!

Figure 2.5: The snapshot of Tangerine!.

Tangerine!8 is a playlist generation tool for workouts by Potion Factory9 (figure

2.5. It uses several song qualities: beats per minute (BPM), beat intensity and personal

song ratings. The user specifies what range of BPM and intensities needed and how8http://www.potionfactory.com/tangerine/9http://www.potionfactory.com/


long the playlist lasts, as shown in figure 2.6.

Figure 2.6: Selecting BPM, beat intensity range, and playlist duration in Tangerine!.

Tangerine then automatically generate a random assortment of songs that meet the

criteria. The user is also allowed to select workout patterns. For example, one can

ramp up the workout and then cool down, use a series of high-intensity songs mixed

with less-intense resting intervals, or just pick a random selection of songs with roughly

the same characteristics, as shown in figure 2.7.

Figure 2.7: Selecting workout pattern in Tangerine!.

After setting the criteria, Tangerine generates a playlist and displays it for user in-

spection. It also provides close iTunes integration, including the ability to load and save

playlists to iTunes, get the album arts from iTunes and export the BPMs to iTunes.

2.4 Reinforcement Learning

Reinforcement learning in computer science, inspired by psychological theory, is a

sub-area of machine learning concerned with how an agent ought to take actions in

2.4. REINFORCEMENT LEARNING 15

an environment so as to maximize some notion of long-term reward. Reinforcement

learning algorithms attempt to find a policy that maps states of the world to the actions

the agent ought to take in those states. Thus, reinforcement learning is particularly

well suited to problems which include a long-term versus short-term reward trade-off.

It has been applied successfully to various problems, including robot control, elevator

scheduling, telecommunications, backgammon and chess [30].

In addition to typical approaches (e.g., content-based filtering and collaborative fil-

tering) for recommender systems [2], reinforcement learning has been used for recom-

mendation in several applications. WebWatcher [13] is a Web tour guide that exploits

Q-Learning to guide users to their desired pages. Pages correspond to states and hyper-

links to actions, rewards are computed based on the similarity of the page content and

user profile keywords. It is used for online information filtering in [35] whichmaintains

a profile for each user containing keywords of interests and updates each word's weight

according to the implicit and explicit feedbacks received from the user. The proposed

learning method showed superior performance in information quality and adaptation

speed to user preference in online filtering. A general framework is presented in [1],

which consists of a database of recommendations generated by various models and a

learning module that updates the weight of each recommendation by user feedback. In

[29] a travel recommendation agent is introduced which considers various attributes

for trips and customers, computes each trip's value with a linear function and updates

function coefficients after receiving each user feedback. The recommendation problem

is modeled as an Markov Decision Process (MDP) in [28]. The system's states corre-

spond to user's previous purchases, rewards are based on the profit achieved by selling


the items and the recommendations are made using the theory of MDP and their novel

state-transition function. A reinforcement learning approach for usage-based web rec-

ommendation is presented in [31]. Instead of using the static patterns discovered from

web usage data, it learns to make recommendations based on the actions it performs

in each situation. The problem is modeled as Q-Learning while employing concepts

commonly applied in the web usage mining domain.

Chapter 3

Emotion-based Personalized

Automatic Playlist Generation

The goal of our emotion-based personalized automatic playlist generation is to learn a

user's preference about music emotion from his/her listening behavior, and to recom-

mend appropriate songs in a playlist. In the following sections, we explicitly define

our problem and present the proposed solution.

3.1 Problem Definition

We start the chapter by defining the terminology. S denotes the system assigned to

solve the emotion-based personalized APG problem. U denotes the user who interacts

with system S. Music collectionM = {m1, m2, · · · , mn} is a finite set of songs. An

episode is defined as a one-time listening period of music which starts from opening the

music player, continues with subsequent played songs, and ends by closing the music

17

18 CHAPTER 3. EMOTION-BASED PERSONALIZED APG

player. User operation set O = {replay, skip, rate} is a finite set of operations user U

can perform during an episode. Table 3.1 gives a complete summary of the notations.

Notation Definition DescriptionS The systemU The userM M = {m1, m2, · · · , mn} The music collectionm m ∈ M A song in the music collectionMO O = {replay, skip, rate} The user operation set.o o ∈ O A user operation in the user operation set Oe e = (o, t) An operation entryl l = (m, tm

s , tme ,Q) A listening log

Q Q = {e1, e2, · · · , en} A finite set of operation logsHt Ht = (l1, l2, · · · , lk) The listening history at time t

util(mt) The utility function

Table 3.1: The summary of the notations.

Definition 1. Operation Entry

An operation entry is a pair e = (o, t) , where o ∈ O is the user operation, and t is the

time of performing the user operation o.

Definition 2. Listening Log

A listening log is a quadruple l = (m, tms , tm

e ,Q), where m ∈ M denotes the played

song, tms denotes the starting time of playing song m, tm

e denotes the ending time of

playing songm, and Q = {e1, e2, · · · , en} denotes the finite set of operation entries.

For each individual playing of a song in an episode, a corresponding listening log

is recorded. The set of operation logs, Q, can contain zero or more logs.

3.1. PROBLEM DEFINITION 19

Figure 3.1: A sample user scenario.

Definition 3. Listening History

A listening history Ht = (l1, l2, · · · , lk) at time t is a temporally ordered sequence of

past listening logs, where li.tms ≤ lj.tm

s ≤ t for all i ≤ j ≤ k.

In the thesis, we assume the interaction between the user U and our APG system S

in a listening episode follows the scenario:

1. An episode starts with an initial song, which can be either user-specified or sys-

tem generated.

2. S dynamically plays the next recommended songmi ∈ M in the playlist.

3. During the episode, U can perform one of the three operations oj ∈ O to reveal

his/her preference about the song currently played.

4. Each listening log l will be logged in the listening history H immediately.

Figure 3.1 illustrates a sample scenario.

In order to measure the effectiveness of recommended song mt by system S, the

notion of utility function should be defined. Different kinds of utility functions can be


Figure 3.2: The System Architecture of MEonPlay Automatic Playlist Recommender.

defined for effectiveness measurement (to be elaborated in 5.3).

Definition 4. Utility Function

Given a recommended songmt at time t, the utility function util(mt) is used to measure

the effectiveness of the recommendation.

The specific problem our system is assigned to solve can be defined as follows:

Definition 5. The Problem

Given the music collectionM, at time t, according to user U 's current listening history

Ht, system S has to dynamically generate the next song mt ∈ M, such that the utility

function util(mt) is optimized.

3.2. PROPOSED SOLUTION 21

3.2 Proposed Solution

3.2.1 MEonPlay Automatic Playlist Recommender

In this thesis, we propose a system named MEonPlay1 Automatic Playlist Recom-

mender to provide personalized automatic playlist generation on the basis of users'

preference of music emotion. Figure 3.2 illustrates the system architecture of the sys-

tem.

Fourmajor components are contained inMEonPlay, includingMusic EmotionModel

(MEM), Listening LogRepository (LLR), User PreferenceModel (UPM), and Emotion-

based User Interface (EUI).

Music Emotion Model (MEM) maintains the function fme : M → E, which maps

songs to their corresponding emotions. Given a songm ∈ M, MEM is used to estimate

the emotion of the song. Russell's two-dimensional (valence-arousal) model of emo-

tions is adopted here to represent the concept of emotion E. Listening Log Repository

(LLR) maintains the user's current listening historyH. User's subsequent listening logs

are added in H immediately.

Given MEM and LLR, User Preference Model (UPM) maintains the function fhm :

Ht → M, which maps the listening history at time t, Ht, to a recommended songm. It

aims to approximate user U 's real mapping f∗hm : Ht → M which is unknown by our

system S.

In addition to the existing graphical user interface of music players, MEonPlay

1The name MEonPlay is the abbreviation of "Music Emotion on Play", in which "ME" also involvesthe notion of "personalization".


provides an alternative navigation interface, Emotion-based User Interface (EUI), to

allow users to explore their music collection in the emotion dimension. EUI provides

a two-dimensional navigation plane which corresponds to Russell's valence-arousal

model. The interaction between the user and the system is accomplished with the EUI,

including user query, system recommendation, and so on.

3.2.2 APG as a Reinforcement Learning Problem

As we mentioned in 1.2, the objective of our system is to provide personalized auto-

matic playlist generation. In our definition, the objective aims to learn the mapping

fhm : Ht → M in UPM to approximate the user's real preference f∗hm : Ht → M. In

other words, our system aims to learn how to generate the right songmt at time t, given

the user's current listening history Ht in UPM.

Unlike the typical classification problem, supervised learning cannot be applied in

this case, since we do not have the exact answer in every situation or even a static

one. The listening history Ht will evolve over time. Therefore, more importantly, our

system has to learn how to recommend a song under a certain situation (that is, the

listening historyHt at time t) so that the long-term results of recommendation (e.g., an

entire episode) can be optimized. Moreover, in the scenario of our APG problem, the

user's feedback (replay, skip, or rate) serves as a natural indicator to guide and reinforce

learning. Thus, we adopt reinforcement learning for learning the user's preference.

As stated in 2.4, reinforcement learning [30] is a research field of machine learn-

ing which aims to learn which actions an agent should take in an environment so as to

maximize long-term reward. In each step, the agent should take an action a and transit

3.2. PROPOSED SOLUTION 23

Figure 3.3: The agent-environment interaction in reinforcement learning.

from a state s to another state sÕ. After each transition, the agent receives a reward.

Figure 3.3 diagrams the concept of the agent-environment interaction in reinforcement

learning. The agent aims to learn a policy that defines which action should be taken in

each state in order to receive the greatest accumulative reward, along the path to the

goal state. Instead of static recommendation, we argue that it is more appropriate to

consider the APG problem as a sequential optimization problem, as the sequential na-

ture of the recommendation process take into account not only the utility of a particular

recommendation but also the long-term reward [28]. In particular, because in our prob-

lem the APG system concerns learning how tomake recommendations in each situation

according to the user's feedback, reinforcement learning provides appropriate learning

methods. In the following chapter, we formulate the APG task as a reinforcement learn-

ing problem and illustrate the states, actions, reward function, and the training process

in our formulation.


Chapter 4

Emotion-Based Adaptive Preference

Model

To provide a personalized solution under the scenario described in the previous chap-

ter, we argue that it is more appropriate to consider APG as a continuous optimization

problem. In this chapter, we first describe our dataset used in theMusic EmotionModel

(MEM). We then propose an emotion-based adaptive preference model that adopts re-

inforcement learning methods to learn User's Preference Model (UPM) about music

emotion in a playlist. Hypothetical users are created for fine-tuning learning parame-

ters in our model.

4.1 Annotated Music Dataset

In this section, we describe the compilation of the annotated music dataset named

POP500, and the specific usage in our model.

25

26 CHAPTER 4. EMOTION-BASED ADAPTIVE PREFERENCE MODEL

4.1.1 Songs

The POP500 dataset contains 526 Chinese pop songs by 76 artists released between

2002 and 2008. We separately collect each song's audio track and lyrical text which

are used in the annotation procedure (to be elaborated later).

4.1.2 Paritcipants

More than 400 participants contribute to the annotation of POP500 dataset. Most of

them are 20 to 30-year-old college students who major in diverse disciplines. All the

participants are pre-trained to fully understand the experiment for annotation, including

the emotion model used and the annotation process (to be elaborated later).

4.1.3 Music Emotion Model

Russel's Valance-Arousal (V-A) emotion plane [27] model is adopted in the dataset.

Instead of using continuous two-dimensional vectors, each axis (valence/arousal) is

partitioned into five discrete values (−2, −1, 0, 1, 2), with 2 being the highest (plea-

sure in valence, and energetic in arousal), −2 being the lowest (displeasure in valence,

and tired in arousal), and 0 representing the neutral response. During the annotation

process, the participants are required to give V-A ratings to songs in the domain of these

five discrete values.

4.1. ANNOTATED MUSIC DATASET 27

4.1.4 Annotation Process

To further examine the individual contribution of lyrical text and music track to music

emotion, we separately collect the emotion ratings of the two parts. In the annotation

process, we define three different sessions for collecting emotion ratings of lyrical text

only (L), music track only (M), and both (ML). The participants can only see the lyrical

text in the L session, hear only the music track in theM session, while they can percieve

both sets of information in theML session. Note that in theM session, we do not remove

the vocals from the music track, since we consider the vocal as an important component

in music track to convey emotion. In each session, the participants are asked to give a

V-A rating to the song. Besides the ratings, additional information are also collected in

the process, including familiarity and preference level of a song. Figure 4.1 shows the

snapshot of the user interface for annotation in the three sessions.

(a) The L session (Chinese) (b) The M session (Chinese) (c) The ML session (Chinese)

Figure 4.1: The snapshot of the user interface for annotation in the three sessions.A two-dimensional V-A plane, partitioned into five discrete regions in each axis, isprovided for rating the emotion. The participant can directly select a region to specifythe V-A values.


Quadrant (+, +) (-, +) (-, -) (+, -)

Class I II III IV

Number 213 212 52 49

Percentage 40% 40% 10% 10%

Table 4.1: Number of songs and percentage

across four emotion classes (refer to Figure

4.2).

Figure 4.2: Four emotion classes

categorized by the cooresponding

quadrants on the V-A plane.

4.1.5 Usage of the POP500 Dataset

Each song in the dataset is associated with the emotion ratings (V-A value) that was

manually annotated by pre-trained users. In our system, we categorize all the songs in

the dataset into 4 classes according to which quadrants they belong to. Table 4.1 shows

the number of songs and percentages across 4 emotion classes in the dataset.

4.2 PreferenceModeling with Reinforcement Learning

Reinforcement learning is adopted here to learn a personalized preference model in

UPM for playlist generation. In the following sections, we explicitly describe the prob-

lem formulation, learning methods, and parameter tuning.

4.2. PREFERENCE MODELING WITH REINFORCEMENT LEARNING 29

4.2.1 Modeling States and Actions

To learn the user's preference for transition pattern of music emotion, the states should

represent the user's accumulative and current listening experience, i.e., past listening

sequence. One possible approach is to allow the states to contain any sequence of music

emotion listened by the user so far. Of course, with this approach we would face an

infinite number of states which would make the learning process difficult to converge.

In our approach, we restrict our history of states in the learning process to a manageable

number. For this purpose, we adopt the N-Grams model which is commonly used in

natural language processing to predict the next item in a sequence. In our model, a

sliding window of sizem is used as the user's listening history. That is, a state contains

only the last m songs' emotion classes. The assumption of using the N-Grams model

is that knowing only the lastm songs' emotion classes gives us enough information to

predict user's next preferred musical emotion. With the definition of states, we define

the action as choosing an emotion class. Taking an action makes the sliding window

move ahead one step and update the state. Figure 4.3 illustrates an example. One tricky

issue worth considering is that if an action turns out to be a bad choice (the user does

not like the recommended song), should we move the sliding window anyway or leave

the state unchanged? In our experiment, we simply leave the state unchanged.

4.2.2 Designing the Reward Function

Using a reward function to measure the progress towards the goal is one characteristic

of reinforcement learning. In our APG problem, we use two reward function types:


Figure 4.3: State and action definition with window sizem = 3. The initial state startswith a song with emotion a. Taking an action makes the sliding window move aheadfor one step and the state changed.

implicit feedback (FI) and explicit feedback (FE). Implicit feedback includes the lis-

tening time (LT) and the number of replays (NR). Both of which may reflect the user's

appreciation of the recommended song. Also, users are allowed to rate each recom-

mended song and we use this user rating (UR) as the explicit feedback. The overall

reward function can be summarized as follows:

rt = δREt + (1 − δ)RI

t , 0 ≤ δ ≤ 1 (4.1)

whereREt is the explicit feedback andRI

t is the implicit feedback at time t. The param-

eter δ controls the relative contribution of each feedback. The reason why we define

two different components in the reward function is based on the assumption that the

relative significance of them is different. That is, explicit feedback might provide more

accurate information, while implicit feedback might contain more noise. The individ-

ual feedback is computed as follows:

REt =

ÿ

i∈FEcifit

RIt =

ÿ

j∈FIcjfjt

where FE = {UR} is the set of explicit feedback, FI = {LT,NR} is the set of implicit

feedback, fit is the score of feedback i at time t, and ci and cj stands for the weight of


explicit feedback and implicit feedback, respectively.

4.2.3 Solving APG with temporal-difference learning

Several solution methods can be used for solving the reinforcement learning problem.

Temporal-difference (TD) learning is one of the fundamental methods concerning the

notion of delayed reward. The idea of delayed reward is that taking an specific action

may affect not only the immediate reward but also the next situation and so on. More-

over, the reward might not occur in every step, and might even be unpredictable. An

analogy of the delayed reward is to prepare for a midterm exam. A student might study

hard everyday for a month before the midterm. During the one-month preparation, the

student might not receive any kind of reward. The actual reward occurs one month later

(after passing the midterm). However, we cannot totally ignore the effort of studying

hard during the month. That is, all of the actions taken beforehand should somehow

share a portion of the delayed reward. Therefore, TD learning takes the mechanism of

propagating the delayed reward back to the previous states into consideration.

In addition, temporal-difference learning consists of the ideas of Monte Carlo and

Dynamic Programming (DP). That is, it not only can directly learn from experience

without knowing the model of the environment's dynamics, but also updates estimates

based in part on other learned estimates, without waiting for a final outcome (a process

known as bootstrapping).

In the APG problem, TD learning serves as a well-suited method to solve the prob-

lem, since we do not have the exact model of the environment dynamics beforehand,

e.g., the exact reward of performing an action in a state, and we want the learning to be


as fast as possible. Algorithm 1 specifies the TD learning in procedural form. Basically

it is composed of a nested loop. For each step (inner loop) in each new episode (outer

loop), four major actions should be performed: (1) choose an action for a state given

the policy (2) take the action (3) observe the responding reward and the next state (4)

update the action-value function (Q-value)

Algorithm 1 Temporal-difference Learning1: Initialize Q(s, a) arbitrarily, π to the policy to be evaluated2: repeat {for each new episode}3: Initialize s4: repeat {for each step in the episode}5: Choose action a given by π for s6: Take action a7: Observe reward r and next state sÕ

8: Update Q(s, a)9: s ← sÕ;10: until s is terminal11: until the end of learning

In our approach, two TD learning methods are used: Q-Learning and SARSA. Both

methods are primarily concerned with estimating the value of performing any action

in each state, known as the action-value function (Q-value). Their main difference is

that Q-Learning is an off-policy control method and SARSA is an on-policy control

method. We use the following update rule:

SARSA: Q(st, at) ← (1 − αn)Q(st, at) + αn[rt+1 + γQ(st+1, at+1)] (4.2)

Q-Learning: Q(st, at) ← (1 − αn)Q(st, at) + αn[rt+1 + γmaxa

Q(st+1, a)] (4.3)


Algorithm 2 APG with SARSA method1: Initialize Q(s, a) arbitrarily2: repeat {for each episode}3: Initialize s4: Choose a from s using ‘-greedy policy5: repeat {for each step in the episode}6: Take action a, generate next song with g(a)7: Observe r and next state sÕ

8: Choose aÕ from sÕ using ‘-greedy9: Update Q with (4.2)10: s ← sÕ; a ← aÕ;11: until s is terminal12: until the end of learning

where αn =1

1 + VisitCount(s, a)

Here st, at, and rt denote the state, action, and reward at time t, respectively. The

αn is the learning rate and γ is the discount factor. The VisitCount(s, a) represents the

number of visits of Q(s, a). The decreasing value of αn helps the Q-value gradually

converge to the answer.

The online and incremental procedure of learning and playlist generation can be

summarized as algorithm 2 and 3:

Algorithm 2 provides the pseudocode of our APG solution utilizing the SARSA

method, where s and sÕ in line 3 and 7 denote current state and next state, a and aÕ in

line 4 and 8 denote the actions, r in line 7 denote the received reward. Q(s, a) represents

the action-value function of performing action a in state s. g(a) is the function used

to select a song given an action a, which is the chosen emotion in our definition (to

be elaborated later). As in all on-policy methods, we continually estimate Qπ for the


Algorithm 3 APG with Q-Learning method1: Initialize Q(s, a) arbitrarily2: repeat {for each new episode}3: Initialize s4: repeat {for each step in the episode}5: Choose a from s using ‘-greedy policy6: Take action a, generate next song with g(a)7: Observe r and next state sÕ

8: Update Q with (4.3)9: s ← sÕ;10: until s is terminal11: until the end of learning

behavior policy π, and at the same time change π toward greediness with respect to

Qπ.

Algorithm 3 provides the pseudocode of our APG solution utilizing the Q-Learning

method, where s and sÕ in line 3 and 7 denote current state and next state, a in line 5 de-

notes the actions, r in line 7 denote the received reward. Q(s, a) represents the action-

value function of performing action a in state s. g(a) is the function used to select a

song given an aciton a, which is the chosen emotion in our definition (to be elaborated

later). In this case, the learned action-value function, Q, directly approximates Q∗, the

optimal action-value function, independent of the policy being followed.

Since the action we defined is not which song should be recommended but which

song emotion should be recommended, a rule, g(a), deciding how to select a song from

an emotion category is required. In our method, we just randomly select one. Thus, the

APG system dynamically recommends the next song according to the current Q-value

(s) and the policy followed. Here we use the ‘-greedy policy. That is, most of the time

in a state the action with optimal Q-value would be taken, but with a small probability


‘ a random action would be taken to help explore other states and converge. We use

the exponentially decayed ‘ value:

‘t = ‘0e−λt, λ> 0 (4.4)

Here ‘t is the ‘ value at time t, ‘0 is the initial value, and λ is the decay constant.

Selecting a higher ‘ at the beginning and reducing it over time help find the optimal

action earlier and also get good rewards in the long term.

4.2.4 Parameter Selection with Simulation

Before being evaluated by real users, we first use hypothetical users (HU) to evaluate

our approach under different settings, including various window sizes m in the state

definition and varying parameters in learning methods. Two different HUs, one with

simple behavior and the other with more complicated preferences, are defined as fol-

lows:

HU-1 The user always prefers songs with the same emotion class as the seed song.

Any song with different emotion is skipped.

HU-2 The user usually prefers a smooth emotion transition of songs in the playlist, but

with probability 10% would choose randomly. Here smooth emotion transition

means that the emotion class of any two contiguous songs in the playlist can at

most have different signs in one dimension of the V-A plane model, e.g., from

class I to class II or to class IV but not to class III.

In the hypothetical cases, only the implicit feedbacks are considered.


Figure 4.4: Learning curve of HU-1 (simple case). The horizontal axis represents thenumber of episodes. The vertical axis represents the Normalized Root Mean SquareError (NRMSE). Q-Learning converges much faster than SARSA, and the learningcurves of SARSA show a more obvious fluctuation. In addition, choosing differentwindow sizem can affect the time to convergence.

Different Window Sizes

In the state definition, we adopt the N-Grams model by setting the window size as m

in the user's listening history. The window size can be critical in affecting our system's

performance. Increasing m might provide more information, however it also enlarges

the state space and greatly increases the convergence time. For each HU, we automat-

ically generate 200 listening episodes, each contains 20 songs, with the two learning

methods under different window sizem = 1, 2, 3. The learning curve is drawed based

on the average of 100 times. Figures 4.4 and 4.5 illustrate the learning curves of HU-1

and HU-2, respectively.

The results first show that Q-Learning converges much faster than SARSA in both

HUs, and since the convergence properties of SARSA are related to the policy's depen-

dence on Q, the learning curves show a more obvious fluctuation. Second, in HU-2,


Figure 4.5: Learning curve of HU-2 (complicated case). The horizontal axis representsthe number of episodes. The vertical axis represents the Normalized RootMean SquareError (NRMSE). Q-Learning converges much faster than SARSA, and the 10% proba-bility randomness causes the NRMSE higher and the fluctuation more obvious in bothlearning methods. In addition, choosing different window sizem can affect the time toconvergence.

a more complicated case than HU-1, the 10% probability randomness causes the Nor-

malized Root Mean Square Error (NRMSE) to become higher and the fluctuation more

obvious in both learning methods. Third, choosing different window sizesm can affect

the convergence time. For example, with tri-gram, even in the simpler HU-1, more

than 100 episodes are required for Q-learning to converge to NRMSE less than 0.1.

This might make it difficult to apply the learning algorithm to practical usage. Thus,

choosing an appropriate window size is critical in our approach. In our experiment, we

choose window size 2 for the balance of convergence time and modeling ability.

Different decay constant in ‘-greedy policy

The decay constant in formula (4.4) controls the decreasing rate of the ‘ value in the

‘-greedy policy. A higher value of the decay constant causes the policy to follow the


currently optimum actions sooner, but might not explore all the states enough times.

On the other hand, a lower value gives the policy higher propability to explore all

the states, but might not follow the optimum actions immediately. To determine an

appropriate value of decay constant, we gradually change the value in both HUs and

learning methods, and then observe the average reward per episode.

λ0.01 0.05 0.10 0.15 0.20

HU-1 SARSA 68.9 93.4 96.1 93.9 91.5Q-Learning 69.0 93.4 96.2 94.9 91.5

HU-2 SARSA 76.1 79.2 81.4 82.3 81.2Q-Learning 76.5 80.6 81.5 82.6 82.0

Table 4.2: Average reward per episode (normalized to range [0, 100]) with different λin both HUs and both methods.

Table 4.2 shows that in both HUs and both methods, in the beginning increasing λ

leads to higher average reward. However, after reaching the upper bound, the average

reward starts to drop as expected. In our experiment, we choose the λ value 0.1.

Chapter 5

Experimental Evaluation

To best evaluate our approach for APG, we deployed a real-user-based experiment

for 2 months, and compared different methods with several evaluation metrics. The

experiment details and results are elaborated in the following sections.

5.1 The Participants

There were total of 5 participants in the experiment. Most of themwere college students

aged from 20 to 30.

5.2 Experiment Design

The experiment was divided into training and testing phases. In the training phase,

participants were required to listen to at least 20 songs in each single episode, and

follow the scenario (described in Section 3.1) to give the system feedback during the

39

40 CHAPTER 5. EXPERIMENTAL EVALUATION

listening period. In addition, participants were asked to describe the listening context,

i.e., the activity while listening the music, of each listening episode. Both methods, Q-

Learning and SARSA, are applied to learn participants' preference for music emotion

transition.

In the testing phase, each episode starts with a seed song chosen by the partici-

pant. Then, one of the three methods, Shuffle, SARSA, and Q-Learning, are randomly

selected to generate the next song in the playlist. Participants followed the scenario

(described in Section 3.1) as in the training phase. The episode ends when the length

of the playlist reaches 20. The Shuffle (random generation) was chosen as a baseline

approach for comparison, since it is one of the most common APGmethods provided in

modern music players (e.g., iPod). Participants were not aware of the exact underlying

method and listened to the music as usual. Besides providing all the user feedbacks dur-

ing an episode, participants were asked to give a rating (from 1 to 5) to each generated

playlist for later evaluation.

Starting from the training phase, participants waited until the number of training

samples exceeds a certain threshold (20 in our experiment) to enter the testing phase.

On average, an episode with 20 songs lasted more than one hour, and participants per-

formed less than two episodes in a day. Therefore, the whole user experiment took

around two months to achieve a preliminary result. Figure 5.1 shows the histogram for

the episode count of each user in training phase and testing phase, respectively.

To reduce the complexity of the new user interface, we implemented our experi-

ment as a plug-in for Apple iTunes. Thus, the participants can provide more accurate

information using the same iTunes music player. Figure 5.2 demonstrates the snapshot

5.3. EVALUATION METRICS 41

Figure 5.1: The histogram for the episode count of each user in training phase andtesting phase, respectively.

of the interface of the experiment. Two playlists, training and testing, are automatically

created in the training phase and testing phase, respectively.

5.3 Evaluation Metrics

We use the following 4 metrics to evaluate each approach:

5.3.1 Miss Ratio

Miss Ratio measures the percentage of unsuccessfully recommended songs. It is cal-

culated as follows:

Miss Ratio =1

N

Nÿ

i=1

skip(i)

n(5.1)


Figure 5.2: The snapshot of the interface of the experiment.

, where skip(i) denotes the number of skips during episode i, n is the length of the

playlist (n = 20, in our experiment), and N is the total number of episodes.

5.3.2 Miss-to-Hit(k)

Miss-to-Hit (k) measures on average how many skips are required to get k successful

recommendations. It is defined as follows:

Miss-to-Hit (k) = (1

N

Nÿ

i=1

skip(i)

n − skip(i))k (5.2)

5.4. EVALUATION RESULTS 43

5.3.3 Listening-Time Ratio

Listening-Time Ratio is the average ratio of user's actual listening time to the total length

of a song, and is defined as follows:

Listening-Time Ratio =1

M

Mÿ

m=1

LT (m)

TT (m)(5.3)

, where LT (m) denotes the listening time of songm, TT (m) denotes the total time

of songm, andM is the number of played songs.

5.3.4 User Rating

User Rating is the average rating for a playlist explicitly given by users.

User Rating =1

N

Nÿ

i=1

rating(i) (5.4)

, where rating(i) denotes the rating for user i.

5.4 Evaluation Results

Table 5.1 lists the mean and standard deviation of Listening-Time Ratios of Shuf-

fle, SARSA, and Q-Learning methods. Q-Learning method outperforms Shuffle and

SARSA methods by showing a higher mean and a much lower standard deviation. The

SARSA and Shuffle methods show similar performance. Table 5.2 lists the mean and

standard deviation of the User Rating of the three methods. The result is the same as the

previous one, except that SARSA has a little better mean user rating than the Shuffle


method. The reason why SARSA did not perform as well as the Q-Learning method

may be due to its longer converge time and the fluctuation as described in Section 4.2.4.

Shuffle SARSA Q-LearningMean 80.56% 79.83% 82.91%STD 14.01% 14.31% 7.68%

Table 5.1: The mean and standard deviation of Listening-Time Ratio of Shuffle,SARSA, and Q-Learning methods.

Shuffle SARSA Q-LearningMean 2.67 2.83 3.32STD 0.94 1.09 0.65

Table 5.2: The mean and standard deviation of User Rating of Shuffle, SARSA, andQ-Learning methods; scoring range: [1, 5].

Figure 5.3 shows the results of Miss Ratio. We can see that Miss Ratio of both

SARSA and Q-Learning was better than Shuffle, and, in particular, Q-Learning out-

performed Shuffle by 10%. Figure 5.4 shows the result of Miss-to-Hit (20). In Miss-

to-Hit (k), the advantage becomes even larger. To generate a satisfactory playlist of

length 20, on average using Q-Learning users have to skip 9 songs, but using Shuffle

they have to skip nearly 15 songs. That is, to get a successfully recommended song,

using Shuffle users have to expend more than 1.5 times the amount of effort compared

to using Q-Learning method.

A major difference between a single recommendation and automatic playlist gen-

eration is that APG is concerned with the listening continuity in a playlist. To see the


Figure 5.3: The Miss Ratio of Shuffle, SARSA, and Q-Learning methods.

Figure 5.4: The Miss-to-Hit(20) of Shuffle, SARSA, and Q-Learning methods.

improvement in this aspect, we count the non-interrupted number of songs in a playlist

(the number of continuous playing of songs without interposed skip operations) in each

method. Figure 5.5 shows the percentage of the different number of continuous songs

in each method. In all the three methods, the continuous play of two songs dominates

over others. However, in SARSA and Q-Learning methods, the percentage of larger

continuous number is obviously higher than that in Shuffle method. Moreover, large

continuous number (> 10) occur more often in SARSA and Q-Learning than in Shuffle


Figure 5.5: The Continuous Play of Shuffle, SARSA, and Q-Learning methods.

method. For example, in Q-Learning the maximum number of continuous songs is 13,

while in Shuffle continuous number of songs larger than 9 never occurs.

The above results show the average performance of three methods. For further

examination of the application of our approach to different users and contexts, we cat-

egorize the data into two groups according to the listening context given by the users.

Figure 5.6 demonstrates the Miss-to-Hit Ratio of two users under the two contexts re-

spectively.

The result shows that under the same context, our approachmay perform differently

for different users. For example, under the Working Context, SARSA and Q-Learning

save nearly half the effort of the Random method for User 1. However, the context

makes no obvious difference to User 2. In addition, the result also shows that for

a specific user, our approach might perform differently under different contexts. For

example, for User 2, our approach provides little improvement under theWorking Con-

text, but significantly outperforms Random method under Leisure Context, especially


Figure 5.6: The Miss-to-Hit Ratio of User 1 and User 2 under Working and Leisurecontext respectively.

with the Q-Learning method. On the basis of this observation, the result suggests that

context is one of the critical factors of APG systems and it is also user-dependent.

To sum up, on average our approach for personalized APG shows certain improve-

ment in comparison with the baseline approach. In particular, the Q-Learning method

outperforms other methods in most of the evaluation metrics. The results also corre-

sponds to the simulation results with hypothetical users. Moreover, the influence of

listening context might serve as a preliminary investigation for future emotion-based

APG systems.


Chapter 6

Conclusion

Organizing and searching by the emotion of songs become a novel and natural trend for

music exploration. Inspired by the trend and requirement of music recommendation,

this thesis proposes an emotion-based adaptive preference model for automatic playlist

generation (APG). We argue that it is more appropriate to consider the APG problem

as a continuous optimization problem. First, we explicitly describe the user scenario

and formally define the problem to be solved. MEonPlay, an emotion-based personal-

ized APG system, is proposed as a solution for the problem. Reinforcement learning

is adpoted here to model the emotion-based personalized APG problem. We then de-

scribe how we formulate the problem as a reinforcement learning problem, including

state, action definition and reward function. Due to the characteristic of temporal-

difference (TD) learning and its conformity to the problem, two TD learning methods,

i.e., SARSA and Q-Learning, are applied for solving the problem. Furthermore, we

create hypothetical users to help fine-tune learning models' parameters, including the

window size in the state definition and decay constant in the ‘-greedy policy. At the

49

50 CHAPTER 6. CONCLUSION

end, we conduct a two-month user study to evaluate our approach. Several evaluation

metrics are defined to measure the successfulness of our playlist recommendation. The

results show that the Q-Learning approach outperforms the SARSA and Shuffle meth-

ods (a baseline approach) in both listening-time ratio and user rating measurement. On

the other hand, in miss-ratio and miss-to-hit(k) measurement, both Q-Learning and

SARSA show a superior performance over Shuffle. The number of continuous play

measurement demonstrates a preliminary result that our approach effectively extends

users' non-interrupted play of songs. An observation at the end also indicates the rela-

tionship between music emotion and listening context.

6.1 Summary of Contributions

In contrast to the previous research in automatic playlist generation, our contributions

can be summarized as follows. First, instead of using metadata or audio similarity,

we approach the APG problem from a differnt perspective. That is, according to the

emerging trend and requirement, we generate playlist based on songs' emotion. Sec-

ond, we consider the APG problem as a continuous optimization problem, and propose

the MEonPlay system for emotion-based personalized APG. Our novel approach of

emotion-based adaptive preference model can be utilized in future personalized APG

systems. Third, several evaluationmetrics are defined for measuring the successfulness

of the APG system. A real user study is further conducted to validate our approach.

6.2. FUTURE WORK 51

6.2 Future Work

This thesis demonstrates a preliminary results of pure emotion-based APG system. The

factors deciding user's listening preferences are complicated and user-dependent. How

to best leverage the multi-dimensional factors in our approach can be further examined

in the future. In addition, reducing the convergence time of learning is an important

issue for our approach to be applicable in a real situation. A future studymay investigate

methods to speed up both the learning and adaptation process.

Bibliography

[1] Reinforcement learning architecture for web recommendations. In Proceedings of the

International Conference on Information Technology: Coding and Computing (ITCC'04),

pages 398--402, 2004.

[2] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: a

survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge

and Data Engineering, 17(6):734--749, 2005.

[3] M. Alghoniemy and A. H. Tewfik. Personalized music distribution. In Proceedings of

International Conference on Acoustics Speech and Signal Processing, pages 2433--2436.

IEEE, 2000.

[4] M. Alghoniemy andA. H. Tewfik. User-definedmusic sequence retrieval. InProceedings

of the 8th ACM International Conference on Multimedia 2000, page 2000, 2000.

[5] M. Alghoniemy and A. H. Tewfik. A network flow model for playlist generation. In

Proceedings of IEEE International Conference on Multimedia and Expo., 2001.

[6] J.-J. Aucouturier and F. Pachet. Finding songs that sound the same. In Proceedings of the

1st IEEE Benelux Workshop on Model based Processing and Coding of Audio (MPCA-

2002), 2002.

[7] J.-J. Aucouturier and F. Pachet. Scaling up music playlist generation. In Proceedings of

IEEE International Conference on Multimedia and Expo., 2002.

52

BIBLIOGRAPHY 53

[8] C. Baccigalupo and E. Plaza. Case-based sequential ordering of songs for playlist rec-

ommendation. In In Proc. of the ECCBR ’06 Conference, 2006.

[9] O. Celma. Music Recommendation and Discovery in the Long Tail. PhD thesis, Univer-

sitat Pompeu Fabra, Barcelona, Spain, 2008.

[10] P. Ekman. An argument for basic emotions. Cognition and Emotion, 6(3/4):169--200,

1992.

[11] A. Flexer, D. Schnitzer, M. Gasser, and G. Widmer. Playlist generation using start and

end songs. In Proceedings of the 9rd International Conference on Music Information

Retrieval (ISMIR 2008), pages 173--178, 2008.

[12] T. Fritz, S. Jentschke, N. Gosselin, D. Sammler, I. Peretz, R. Turner, A. D. Friederici, and

S. Koelsch. Universal recognition of three basic emotions in music. Current Biology,

2009.

[13] T. Joachims, D. Freitag, and T. Mitchell. Webwatcher: A tour guide for the world wide

web. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelli-

gence, pages 770--775. Morgan Kaufmann, 1997.

[14] P. N. Jusling and J. A. Sloboda. Music and Emotion - Theory and Research. Series in

Affective Science. Oxford University Press, New York, 2001.

[15] G. Kreutz, U. Ott, D. Teichmann, P. Osawa, and D. Vaitl. Usingmusic to induce emotions:

Influences of musical preference and absorption. Psychology of Music, 36(1):101--126,

January 2008.

[16] T. Li andM.Ogihara. Detecting emotion inmusic. InProceedings of the 3rd International

Conference on Music Information Retrieval (ISMIR 2003), 2003.

[17] D. Liu, L. Lu, and H. J. Zhang. Automatic mood detection from acoustic music data. In

ISMIR, 2003.

54 BIBLIOGRAPHY

[18] B. Logan. Content-based playlist generation: Exploratory experiments. In Proceedings

of the 3rd International Conference on Music Information Retrieval (ISMIR 2002), 2002.

[19] F. Pachet, P. Roy, and D. Cazaly. A combinatorial approach to content-based music

selection. Multimedia, IEEE, 7(1):44--51, 2000.

[20] E. Pampalk, T. Pohle, and G. Widmer. Dynamic playlist generation based on skipping

behavior. In Proceedings of the 6rd International Conference on Music Information Re-

trieval (ISMIR 2005), pages 634--637, 2005.

[21] S. Pauws and B. Eggen. Pats: Realization and user evaluation of an automatic playlist

generator. In Proceedings of the 3rd International Conference on Music Information


[22] S. Pauws, W. Verhaegh, and M. Vossen. Fast generation of optimal music playlists using

local search. In Proceedings of the 7th International Conference on Music Information


[23] S. Pauws, W. Verhaegh, and M. Vossen. Music playlist generation by adapted simulated

annealing. Information Science, 178(3):647--662, 2008.

[24] J. C. Platt, C. J. C. Burges, S. S., W. C., and Z. A. Learning a gaussian process prior for au-

tomatically generating music playlists. In In Advances in Neural Information Processing

Systems, pages 1425--1432. MIT Press, 2002.

[25] T. Pohle, E. Pampalk, and G. Widmer. Generating similarity-based playlists using travel-

ing salesman algorithms. In Proceedings of the 8th International Conference on Digital

Audio Effects (DAFx'05), 2005.

[26] R. Ragno, C. Burges, and C. Herly. Inferring similarity between music objects with ap-

plication to playlist generation. In Proceedings of the 7th ACM SIGMM International

Workshop on Multimedia Information Retrieval, 2005.

BIBLIOGRAPHY 55

[27] J. A. Russell. Affective space is bipolar. Journal of Personality and Social Psychology,

37(3):345--356, 1979.

[28] G. Shani, D. Heckerman, and R. I. Brafman. An mdp-based recommender system. Jour-

nal of Machine Learning Research, 6:1265--1295, 2005.

[29] A. Srivihok and P. Sukonmanee. Intelligent agent for e-tourism: Personalization travel

support agent using reinforcement learning. In Proceedings of the WWW 2005, 2005.

[30] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1st

edition, 1998.

[31] N. Taghipour, A. Kardan, and S. . Ghidary. Usage-based web recommendations: a rein-

forcement learning approach. In RecSys '07: Proceedings of the 2007 ACM conference

on Recommender systems, pages 113--120. ACM, 2007.

[32] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Towards musical query-by-

semantic-description using the cal500 data set. In SIGIR '07, pages 439--446. ACM,

2007.

[33] M. Wang, N. Zhang, and H. Zhu. User-adaptive music emotion recognition. In ICSP,

2004.

[34] Y. H. Yang, Y. C. Lin, Y. F. Su, and H. H. Chen. A regression approach to music emotion

recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 16(2):

448--457, 2008.

[35] B.-T. Zhang and Y.-W. Seo. Personalized web-document filtering using reinforcement

learning. Applied Artificial Intelligence, 15:665--685, 2001.

56 BIBLIOGRAPHY

Documents

National Taiwan University Master Thesisarchive.music.ntnu.edu.tw/chimeitp/brain/files/brain/brain-sub05.pdf · 數位化的時代來臨， 伴隨網際網路普及的推波助瀾，

National Taiwan University Master Thesisarchive.music.ntnu.edu.tw/chimeitp/brain/files/brain/brain-sub05.pdf · 數位化的時代來臨，伴隨網際網路普及的推波助瀾，