Anna Pienimaki Mirex

8/14/2019 Anna Pienimaki Mirex

1/10

Organised Evaluation in (Music) InformationRetrieval: TREC and MIREX

Anna Pienimki

Department of Computer ScienceUniversity of Helsinki

[email protected]

Abstract. Evaluation is an important component of the informationretrieval field. In the are a of text retrieval, the annual TREC evaluationcontest provides a venue where evaluation can be done in an organisedfashion. Following the example of TREC, both video and music retrievalcommunities have started to build their own evaluation contests address-ing mostly content-based retrieval problems. In this paper we presentbriefly the history and development of organised evaluation in (music)information retrieval. We discuss the organisation and the main proper-ties of TREC and introduce the MIREX evaluation contest designed forevaluating music information retrieval methods. Moreover, we discuss inmore detail how such evaluation contests are organised in practise, usingMIREX as an example.

1 Introduction

Evaluation is an important but ofter slightly neglected component of the infor-mation retrieval (IR) field. Organised evaluation campaigns or contests for textretrieval methods have become more common within last 10 years in the formof TREC (Text REtrieval Conference). However, especially multimedia retrievalcommunities, such as music information retrieval (MIR) community, still lackboth common methodology and evaluation data for evaluating methods devel-oped for answering content-based information needs in an organised fashion.

Organised evaluation is evaluation that is done in a centralised fashion usingcommon test collections to evaluate and compare the results of several retrievalalgorithms performing the same task. Organised evaluation has two importantadvantages: comparability of the methods evaluated and access to realisticallysized test collections built, maintained, and funded by the central organisations.Moreover, it offers a forum for discussions how compared methods differ andhow they could be developed further.

In this paper we discuss the history and development of organised evaluationin (M)IR concentrating on the organisational side of such contests. In Section2 we introduce the concept of content-based MIR and how it differs from text-based IR. In Section 3 we give a brief overview on the history and developmentof evaluation in IR and introduce TREC and its main features. MIR evaluationcontest MIREX is discussed in more detail in Section 4. Section 5 concludes the

paper.


2/10

2

2 An Introduction to MIR

MIR is a considerably new research area that addresses the question of findingand retrieving music documents, such as audio files, MIDI files, and sheet mu-sic. Even though the concept of MIR literally accommodates both meta dataand content-based retrieval methods, it has been widely used as a synonym forcontent-based music information retrieval.

In MIR the documents are searched and browsed using musical parametersas keys. Depending on the application, you may search for a song by hummingits melody to a microphone connected to your retrieval system or browse a musiccollection using the similarity of sound, that is instrumentation and tempo, asyour similarity measure.

The MIR methods can be divided into two main categories: audio and sym-bolic methods. The audio retrieval uses mainly digital signal processing meth-ods and addresses the problem of finding music based on its audio content. Thecommon subtasks for audio retrieval methods are, for instance, beat and tempotracking, frequency estimation and classification based on these features.

The symbolic MIR is based on the score instead of audio, and its most widelyknown application is the so-called query-by-humming retrieval method [Ghias

et al. (1995)]. In query-by-humming the music is retrieved by humming a smallfragment of melody, which is then converted into symbolic form for the actualretrieval task. Thus, symbolic MIR is highly based on the similarity of the score.

When comparing text-based IR and MIR there are several technical differ-ences between the retrieval tasks. Obviously, audio MIR has only a little todo with text-based IR when it comes to the methodology. However, symbolicMIR is somewhat closer to text retrieval and thus certain text retrieval methodshave been quite successfully used for symbolic MIR purposes, as well. However,compared to text, music is technically more demanding as data. Firstly, as mu-sic is almost always polyphonic, it has more dimensions. Moreover, music istransposition-invariant1and has a rhythm that needs to be taken into account,as well. Therefore text-based retrieval methods have proven to be insufficient forsolving most of the symbolic MIR problems.

3 A short history of organised evaluation in IR

3.1 Early evaluation collections

Even though the most well-known organised evaluation framework, TREC, datesonly back to 1992, the first attempts for organised evaluation were carried on asearly as in the late 1950s and early 1960s by Cyril Cleverdon and his colleaguesat the Cranfield College of Aeronautics. These experiments are known nowadaysas the Cranfield paradigm. The first round of the Cranfield paradigm consistedof testing four indexing systems and attracted widespread attention. However,

1 Transposition-invariance is a property that allows a melody to be recognised as the

same regardless of the key in which it is played.


3/10

3

critical examination of the methodology showed that the research design had hadat least partial effect on the results. To answer to his critics Cleverdon devised asecond set of experiments for which he built a test collection consisting of 1400documents and 279 queries2 [Rasmussen (2003)]. This Cranfield II collectionwas subsequently used by other groups for evaluation purposes [Voorhees andHarman (2005)].

In the following 25 years several other test collections, such as NPL (11,429

documents) and UKCIS (27,361 documents), were built [Rasmussen (2003)].Even though these collections were of importance for singular evaluation tasks,they lacked certain necessary features. Namely, they were rarely used in an or-ganised fashion for comparing results of two or more research groups and theywere rather small compared to the real-life tasks and collections [Voorhees andHarman (2005)]. Moreover, the data consisted mainly of document surrogates,such as titles and abstracts, instead of full text documents [Rasmussen (2003)].

3.2 The birth of organised evaluation: TREC

In 1990 NIST (National Institute of Standards and Technology) started to builda large test collection for use in evaluating text retrieval technology developed as

part of the Defence Advanced Research Projects Agency (DARPA) TIPSTERproject. In the following year NIST proposed that this collection should be madeavailable to the research community as an organised evaluation contest whichbecame TREC. The four main goals of TREC were (and still are):

To encourage research in text retrieval based on large test collections To increase communication among industry, academia, and government by

creating an open forum for the exchange of research ideas To speed the transfer of technology from research labs into commercial prod-

ucts by demonstrating substantial improvements in retrieval methodologieson real-world problems

To increase the availability of appropriate evaluation techniques for use byindustry and academia, including the development of new evaluation tech-niques more applicable to current systems

The first TREC was held in November 1992 with 2 tracks (Ad Hoc and Rout-ing) and 25 participant groups. Even though most of the participating groups inTREC-1 concentrated mainly on the necessary rebuilding of their basic systemsto handle the huge scale-up in collection size, the door was finally open for groupdiscussions and collective decisions on what improvements were needed for thefuture TRECs [Voorhees and Harman (2005); Harman (2002)].

TREC-2 was held in a similar fashion to TREC-1 only 9 months later withthe same tasks. Even though most of the participants of TREC-1 participatedin TREC-2 and had had time to develop their methods, the results of TREC-2were not comparable to TREC-1, due to changed topics, that is, the queries

2 In some references, such as Voorhees and Harman (2005) the number of queries is

mentioned to be only 225.


4/10


5/10

5

4.1 MIREX something new, something borrowed

The first tentative step towards organised evaluation in the area of MIR wastaken in 2004 during the 5th ISMIR (International Conference on Music Informa-tion Retrieval) in a form of ISMIR2004 Audio Description Contest (ADF)3organisedby the hosting organisation of ISMIR2004, Universitat Pompeu Fabra, Barcelona.The first and only ADF consisted of 5 audio tasks, such as Genre Classification

and Tempo Induction.At the same time, Prof. Downie at the University of Illinois, Urbana-Champaign,

had started to work on a testbed project based on the earlier discussions. Theproject was called International Music Information Retrieval Systems Evalua-tion Laboratory (IMIRSEL) and it was first introduced in Tuscon, Arizona, ina side meeting of the Joint Conference on Digital Libraries (JCDL 2004). Prof.Downies idea was to include both symbolic and meta data tasks into evaluationprocedure. After the Audio Description Contest Prof. Downies IMIRSEL teamstarted a discussion on the evaluation tasks, which lead into the first MIREXcontest held at the 6th ISMIR in London, organised by the IMIRSEL team.4

In 2006 MIREX (also organised by the IMIRSEL team) was held for the sec-ond time, at the 7th ISMIR in Victoria, Canada. The second MIREX5resembled

closely the MIREX 2005 with one important exception the statistical signifi-cance of the results was measured. This was an operation suggested by one ofthe TREC personnel, Ellen Voorhees, who had participated in the preliminarydiscussions on MIREX already in 2002. While some of the tasks have changedfrom the first Audio Description Contest and new tasks have been added to thetask list, the core of MIREX has stayed relatively stable even though the numberof participants has dropped (Table 2).

In MIREX 2005 and 2006 most of the test data was provided by the taskleaders and it was hand annotated either by the task leaders or their colleagues.In two tasks, however, a ground truth was built using human evaluators. Thesetasks were Audio Music Similarity and Retrieval (2006) and Symbolic MelodicSimilarity (2005 and 2006). For the MIREX 2006 the IMIRSEL team built upan evaluation software, called Evalutron6000 (Figure 1.), which will also be used

in the future MIREX contests.Even though MIREX has taken TREC as its pattern, which can be seen in

the organisation and evaluation measures used, it has MIR specific challengesthat have not been solved this far. The biggest issue is the lack of realisticallysized (audio) test collections, due to copyright issues. Even though Naxos6haskindly offered their database for the use of the research community, it containsonly certain type of music and thus cannot be used for genre classification orartist identification tasks. Moreover, since the contest is run by a single researchproject instead of a national institute, the lack of funding prevents from buy-ing a large collection of popular music. However, two collections, USPOP and

3 http://ismir2004.ismir.net/ISMIR Contest.html4 http://www.music-ir.org/mirex2005/index.php/Main Page5

http://www.music-ir.org/mirexwiki/index.php/MIREX 2006


6/10


7/10

7

USCRAP, have been purchased as CDs and donated to the community by re-search laboratories.

In the area of symbolic MIR the situation with test collections is a bit dif-ferent. There are reasonably though not realistically sized MIDI collections thatare copyright free. However, the quality of the MIDI files varies dramatically andthe size of these collections grows more slowly.

4.2 Organising a MIREX task endless hours of fun?

The author of this paper was asked to co-organise one of the MIREX 2006 tasks,namely Symbolic Melodic Similarity (SMS) with Rainer Typke (University ofUtrecht, The Netherlands), based on the earlier discussions with mr. Typke andProf. Downie about the organisation of the task and the publicity of the test data.The practical issues of organising an evaluation task for an evaluation contestwill be discussed below based on the experiences gained in MIREX 2006.

Planning an evaluation task Before the task leaders can start planning anevaluation task, they need to agree on the task with the IMIRSEL team. In prac-tise, this is done by ensuring the IMIRSEL team that the leaders have access

to suitable test data or are willing to build the collection on the run. When theIMIRSEL team agrees on the task, the task leaders start a wiki page in whichthey write down the description of the task, requirements for the participatingalgorithms (including the interfaces), and an overall description of the test col-lection and queries to be used in the task. The task leaders start an email list forpotential participants and other interested members of the research community,through which the list members may share their opinions and give suggestionsconsidering the task.

In MIREX 2006, SMS task consisted of three separate subtasks: monophonicRISM UK collection task, polyphonic karaoke collection task, and polyphonicmixed collection task. In the monophonic7subtask both the query and the datawere monophonic and the test collection consisted of 15,000 short incipits ofmelodies. There were 6 queries half of which were written (and thus quantised)and the other half hummed/whistled. In both of the polyphonic tasks, the querywas monophonic and the data polyphonic. The karaoke collection consisted of1000 karaoke files and the mixed collection of 10,000 general polyphonic musicfiles. There were 5 karaoke queries, 3 of which were hummed/whistled and 2written. Moreover, there were 6 mixed polyphonic queries, 3 hummed/whistledand 3 written.

Compared to its counterpart, Audio Music Similarity and Retrieval, the SMSlist was very passive and the potential participants merely expressed their in-terest in the task in general. Therefore, the task leaders needed to formulatethe task completely by themselves, with only a few suggestions by the othermembers of the list.

6 http://www.naxos.com/7

A song is monophonic if there is only one note playing at the time.


8/10

8

Preparing the data and the queries When the task was finalised, the taskleaders were asked to prepare the data and the queries for the IMIRSEL team.In practise this meant approximately 80-100 hours of searching and listeningthrough MIDI files. To make the task sensible, the queries needed to have at leasttwo matches in the data, which complicated finding suitable queries especiallyfor the karaoke collection due to its very small size. When a suitable query wasfound, it was either written or hummed/whistled and encoded in MIDI.

Both karaoke and RISM UK collections were used as whole. Thus, there wasno need to prepare the data itself in any other fashion. The mixed collectionwas supposed to be a random selection of 10,000 MIDI files over the Internet.However, to ensure the matches in the collection, we needed to first add all thematches in the collection by hand and then take a random selection to form therest of the collection. In this phase, all the duplicates were removed.

Running the algorithms When the data and queries were ready, the al-gorithms submitted by the participants were run for the first time and 10 bestmatches were pooled for the evaluation procedure. Since many of the polyphonicpieces were over two minutes long, we asked the participants to return only theexact locations of the matches. The algorithms had severe problems with the

MIDI data (due to the quality of both the data and the algorithms) and there-fore some of the results needed to be cut to snippets by hand. This was done bythe IMIRSEL team.

Creating the ground truth The ground truth was created by human evalu-ators. Because of the location of the IMIRSEL team and quite strict rules madeby the US government considering the use of human evaluators, the number ofpotential evaluators was cut very low (21 evaluators for the SMS task). There-fore, each of the evaluators needed to evaluate 13 queries and 15 candidates pereach query. In practise this took 2-7 hours per each evaluator.

In the evaluation procedure each evaluator was asked to compare each ofthe result in his/her result subset to the query and rate them using two rating

systems. The first rating system had three options: not similar, somewhat similar,and very similar. The second rating system was a continuous slide from 0 to 10.

Preparing and publishing the results When the ground truth was com-pleted the results of the human evaluators were compared to the result lists re-turned by each algorithm and 10 different evaluation measures were calculatedfor each algorithm:

ADR = Average Dynamic Recall NRGB = Normalised Recall at Group Boundaries AP = Average Precision (non-interpolated) PND = Precision at N Documents Fine(1) = Sum of fine-grained human similarity decisions (0-10).

PSum(1) = Sum of human broad similarity decisions: NS=0, SS=1, VS=2.


9/10

9

WCsum(1) = World Cup scoring: NS=0, SS=1, VS=3 (rewards Very Sim-ilar).

SDsum(1) = Stephen Downie scoring: NS=0, SS=1, VS=4 (strongly re-wards Very Similar).

Greater0(1) = NS=0, SS=1, VS=1 (binary relevance judgement). Greater1(1) = NS=0, SS=0, VS=1 (binary relevance judgement using only

Very Similar).

(1)Normalised to the range of 0 to 1.

MIREX 2007 Already when planning for the MIREX 2006 there was discus-sion about giving up with the competitive nature of the contest and makingthe contest more collaborative and comparative instead of announcing only thewinners of each task. Due to this, there were no rewards in MIREX 2006. In theMIREX discussion panel held in ISMIR in 2006, the participants wanted to keepMIREX in this way also in 2007. The main complaints were about the enormousamount of time required for building up the ground truth.

To make MIREX 2007 even more collaborative and less competitive, a coupleof new ideas were presented at the panel. One of them was to combine several

tasks to make a continuum. In this type of a task, each combination of algorithmssubmitted into each subtask would be evaluated and the best combination wouldbe announced. In this way, evaluation of the algorithms would make way forevaluation of larger systems, based on the algorithms.

5 Conclusion

Evaluation is an important part of the (M)IR field. Apart from the text-basedretrieval community, multimedia IR research still lacks reasonably sized and pub-licly available test collections. However, the first attempts to evaluate retrievalmethods in an organised fashion have been promising in the fields of video andmusic information retrieval.

Organising an evaluation contest is hard (often volunteer) work with no otherreward but the general acceptance of the research community. However, a com-pleted task is a reward itself. Even though there were minor difficulties with or-ganising MIREX 2006, mainly the lack of communication between the IMIRSELteam and the task leaders and the quality of data used, the contest itself provedto be successful. There are problems to be tackled in the MIREX 2007 and in thefield in general, such as the copyright issues and making the evaluation contestmore collaborative and less competitive.


10/10

Bibliography

Ghias, A., J. Logan, D. Chamberlin, and B. C. Smith (1995). Query by humming musical information retrieval in an audio database. In ACM Multimedia 95,

pp. 231236.Harman, D. K. (2002). The development and evolu-tion of trec and duc. In K. Oyama, E. Ishida, andN. Kando (Eds.), Proceedings of the Third NTCIR Workshop.http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings3/index.html.

Rasmussen, E. (2003). Evaluation in information retrieval. In J. S. Downie(Ed.), The MIR/MDL Evaluation Project White Paper Collection, 3rd ed.http://www.music-ir.org/evaluation/wp.html.

Voorhees, E. M. and D. K. Harman (2005). The text retrieval conference. InE. M. Voorhees and D. K. Harman (Eds.), TREC - Experiment and Evaluationin Information Retrieval, pp. 320. The MIT Press.