Standard Datasets in Information Retrieval

Standard Datasets in IR

By: S.J Brenda

13-11-2014 1

What is dataset

Collection of documents

The data sets includes Document set : Collection of documentsQuery set: Set of information needs collection of

questions asking the IR system for resultsRelevant judgement set: Methods calculate da relevance

between results set & queries

13-11-2014 2

Standard dataset

Test collections

Consists of static set of documentsA set of information needs/topicsA set of known relevant documents for each of the

information needs

Prefer to be in large scaleProper validationRapid growthGreat diversity

13-11-2014 3

WHY we need

Test the systemDetermine how well IR systems perform Compare the performance of the IR system with that of other

systems Compare search algorithms with each other Compare search strategies with each other

13-11-2014 4

Standard Datasets in IR

The Cranfield collection

Text Retrieval Conference (TREC)

Gov2

NII Test Collection for IR system (NTCIR)

Cross Language Evaluation Forum (CLEF)

Reuters

20Newsgroups

13-11-2014 5


This was the pioneering test collection in allowing precise quantitative measures of information retrieval effectiveness

Created in late 1960 The Cranfield corpus is a relatively small information retrieval

corpus consisting of 1400 abstracts on aeronautical engineering topics. The documents contain a total of 136935 terms from a vocabulary of size 4,632 (after stop word removal).

13-11-2014 6


The Cranfield corpus also contains a set of 225 query strings with ground truth relevance judgements.

Source: abstract of scientific papers from aeronautic research field 1945-1962

13-11-2014 7


Experimental assumptions Relevance = topical similarity Static information need All documents equally desirable Relevance judgments are complete and representative of the

user population

Drawbacks Lack of comparison between different systems Small collection not enough

13-11-2014 8

TREC

The TREC corpus is a large data set consisting of articles taken from a variety of newswire and other sources

This data set consists of 528,155 documents spanning a total of 165,363,765 terms from a vocabulary of size 629,469 (after stop words removal)

Also provided are a number of query strings consisting of three parts, a title, description and narrative. Ground truth judgements are available concerning whether or not each of the documents is relevant to each of the queries.

13-11-2014 9

TREC

Design for large data collection

Retrieval method:Not tied to any applicationAdhoc query: in a library

environmentRouting query: filter the result set

Useful for IR system designers

Good for dedicated searcher not novice searcher

13-11-2014 10

TREC- Documents

13-11-2014 11

TREC-Relevance Judgement

3 methodsOn all documents, all topicsRandom sample of documentsPooling

• Divide each set into topic

• Select top ranked documents

• Each topic merge with the results of other system

• Sort based on document number

• Remove duplicate documents

13-11-2014 12

TREC datasetsContextual Suggestion Track :

To investigate search techniques for complex information needs that are highly dependent on context and user interests.

Clinical Decision Support Track : To investigate techniques for linking medical cases to information relevant for

patient care

Federated Web Search Track : To investigate techniques for the selection and combination of search results

from a large number of real on-line web search services.

Knowledge Base Acceleration Track : To develop techniques to dramatically improve the efficiency of (human)

knowledge base curators by having the system suggest modifications/extensions to the KB based on its monitoring of the data streams.

13-11-2014 13

Microblog Track: To examine the nature of real-time information needs and their satisfaction in

the context of microblogging environments such as Twitter.

Session Track : To develop methods for measuring multiple-query sessions where

information needs drift or get more or less specific over the session.

Temporal Summarization Track : To develop systems that allow users to efficiently monitor the information

associated with an event over time.

Web Track: To explore information seeking behaviours common in general web search.

13-11-2014 14

Chemical Track : To develop and evaluate technology for large scale search in chemistry-related

documents, including academic papers and patents, to better meet the needs of

professional searchers, and specifically patent searchers and chemists.

Crowdsourcing Track : To provide a collaborative venue for exploring crowd sourcing methods both for

evaluating search and for performing search tasks.

Genomics Track: To study the retrieval of genomic data, not just gene sequences but also supporting

documentation such as research papers, lab reports, etc. Last ran on TREC 2007.

Enterprise Trac: To study search over the data of an organization to complete some task. Last ran

on TREC 200813-11-2014 15

Entity Track : To perform entity-related search on Web data. These search tasks (such as finding entities

and properties of entities) address common information needs that are not that well modeled

as ad hoc document search.

Cross-Language Track : To investigate the ability of retrieval systems to find documents topically regardless of source

language.

FedWeb Track : To select best resources to forward a query to, and merge the results so that most relevant

are on the top.

Filtering Track : To binarily decide retrieval of new incoming documents given a stable

information need.

HARD Track : To achieve High Accuracy Retrieval from Documents by

leveraging additional information about the searcher

and/or the search context.

•.

13-11-2014 16

Interactive Track : To study user interaction with text retrieval systems.

Legal Track :

To develop search technology that meets the needs of lawyers to engage in

effective discovery in digital document collections.

Medical Records Track :

To explore methods for searching unstructured information found in patient

medical records.

Novelty Track :

To investigate systems' abilities to locate new (i.e., non-redundant)

information.

13-11-2014 17

Question Answering Track : To achieve more IR than just Document Retrieval by answering factoid, list and

definition-style questions.

Robust Retrieval Track : To focus on individual topic effectiveness.

Relevance FeedbackTrack : To further deep evaluation of relevance feedback processes.

Spam Track : To provide a standard evaluation of current and proposed Spam

filtering approaches.

TeraByteTrack : To investigate whether/how the IR community can scale traditional IR test-

collection-based evaluation to significantly large collections.

VideoTrack : To research in automatic segmentation, indexing, and content-based retrieval of

digital video.In 2003, this track became its own independent evaluation named TRECVID

13-11-2014 18

AdvantagesLarger Collections

Better Results

Usable in many tasks Filtering

Web search

Video retrieval

CLEF

NTCIR

GOV2

Drawback: Complete judgement is impossible Pooling can overcome this

13-11-2014 19

Gov2

A TREC corpus consist of 25 million documents from US

government domains and government related websites

TREC topics 701-850 used for evaluation

One of the largest web collection easily available for research

purposes

13-11-2014 20

NTCIR - NII Test Collection for IR system

Built various test collections of similar sizes to the TREC

Focus on East Asian Language and Cross Language information retrieval Query one language document collection more than one languages

13-11-2014 21

13-11-2014 22

CLIR: IR/CLIR test collection

CLIR test collection can be used for experiments of cross-lingual information retrieval between Chinese(traditional), Japanese, Korean and English (CJKE) such asMultilingual CLIR (MLIR)

Bilingual CLIR (BLIR)

Single Language (Monolingual) IR (SLIR).

13-11-2014 23

CLQA (Cross Language Q&A data Test Collection) CLQA Task, the following subtasks were conducted.

1. Japanese to English (J-E) subtaskFind answers of Japanese questions in English documents.

2. Chinese to English (C-E) subtaskFind answers of Chinese questions in English documents.

3. English to Japanese (E-J) subtaskFind answers of English questions in Japanese documents.

4. English to Chinese (E-C) subtaskFind answers of English questions in Chinese documents.

5. Chinese to Chinese (C-C) subtaskFind answers of Chinese questions in Chinese documents.

13-11-2014 24

ACLIA (Advanced Cross-Lingual Information Retrieval and Question Answering)

ACLIA test collection can be used for experiments of Complex Question Answering and Information Retrieval between Chinese (Simplified (CS), Traditional (CT)), Japanese (JA), English (EN) such as

CCLQA (Complex Cross-Lingual Question Answering) Cross-Lingual Question Answering (EN-JA/EN-CS/EN-CT subtask)

Monolingual Question Answering (JA-JA, CS-CS, and CT-CT subtask)

IR4QA (Information Retrieval for Question Answering) Cross-Lingual Information Retrieval (EN-JA/EN-CS/EN-CT subtask)*

Monolingual Information Retrieval (JA-JA, CS-CS, and CT-CT subtask)

13-11-2014 25

CQA (Community QA Test Collection)

This test collection can be used to evaluate the quality of the answer on the CQA site.This test collection consists of the following data. 1500 questions extracted from Yahoo Chiebukuro data version 1.

Assessment results by four assessors

ID lists, best answer lists, and category information, etc.

13-11-2014 26

CROSSLINK (Cross-lingual Link Discovery)

Crosslink test collection can be used for experiments of cross-lingual link discovery from English to CJK (Chinese, Japanese and Korean) document linking such as English to Chinese CLLD (E2C) subtask

English to Japanese CLLD (E2J) subtask

English to Korean CLLD (E2K) subtask

13-11-2014 27

INTENT (INTENT-1)

INTENT (INTENT-1) Test Collections are the following:

(a) NTCIR-9 INTENT Chinese Subtopic Mining Test Collection(b) NTCIR-9 INTENT Japanese Subtopic Mining Test Collection(c) NTCIR-9 INTENT Chinese Document Ranking Test Collection(d) NTCIR-9 INTENT Japanese Document Ranking Test Collection

Subtopic Mining Subtask: given a query, return a ranked list of "subtopic strings."

Document Ranking Subtask: given a query, return a diversified list of web pages.

13-11-2014 28

1CLICK

1CLICK (1CLICK-1) Test Collection was the test collection used at the NTCIR-9 1CLICK (Once Click Access) task.

The NTCIR-9 1CLICK task was: given a Japanese query, return a 500-character or 140-character textual output that aims to satisfy the user as quickly as possible. Important pieces of information should be prioritized and the amount of text the user has to read should be minimized.

13-11-2014 29

Math

Math task was: NTCIR Math Task aims to explore search methods tailored to mathematical content through the design of suitable search tasks and the construction of evaluation datasets.

Math Retrieval Subtask: Given a document collection, retrieve relevant mathematical formulae or documents for a given query.

Math Understanding Subtask: Extract natural language descriptions of mathematical expressions in a document for their semantic interpretation. Relevance judgment

*added Oct/14/2014*

13-11-2014 30

MuST ("summary and visualization of trend information" test collection)

MuST Corpus (untagged), selected from the two years of 1999, is the

581 articles (27 topics) that were used to create task data.

A list of articles is assumed to obtained by the information retrieval

Annotated with to the article content: Extraction result of important

sentences in summary, and corresponds to the semantic processing

results for it

13-11-2014 31

Opinion (Opinion Analysis Task Test Collection)

There are 32 topics ranging from 1998-2001, each in English, Chinese, and Japanese.

The annotations assign opinion tags to sentences in the selected documents that are relevant to the topics.

The documents that are annotated are separately distributed in a sentence-segmented format that aligns with the sentence numbering in the CSV annotation file

13-11-2014 32

MOAT (Multilingual Opinion Analysis Test Collection)

MOAT test collection can be used for experiments of multi-lingual opinion analysis in Japanese, English, and Chinese (simplified/traditional) (CstJE) such as

Opinion Judgement

Polarity (Positive/Negative/Neutral) Judgement

Opinion Holder Identification

Opinion Target Identification

Relevance Judgement.

13-11-2014 33

PATENT (IR Test Collection)

• The collection consists of Document data (Japanese patent applications 1993-1997 and Patent Abstracts of Japan 1993-1997), 101 Japanese search topics (34 topics were translated into English, Simplified and Traditional Chinese, and Korean, respectively), and Relevance judgments for each search topic.

• Japanese patent applications published in 1993-1997 are used for the Retrieval task.

• Each search topic is a claim extracted from Japanese patent applications.

13-11-2014 34

Q&A data Test Collection

The collection includes:

Document data: Mainichi Newspaper articles 1998-2001

Task data: questions(200, in japanese), system’s output, human’s

output and sample answers

13-11-2014 35

SpokenDoc-2 (IR for Spoken Documents)

lecture speech, spoken passage, conversation detection

The test collection includes:

Spoken Term Detection (STD)

inexistent Spoken Term Detection (iSTD) task

Content Retrieval (SCR) Scoring tool for STD and iSTD task

Scoring tool for SCR task

13-11-2014 36

IR and Term Extraction/Role Analysis Test Collections

The IR Test collection includes Document data (Author abstracts of the Academic Conference Paper

Database (1988-1997) = author abstracts of the paper presented at the academic conference hosted by either of 65 academic societies in Japan. about 330,000 documents; more than half are English-Japanese paired,)

83 Search topics (Japanese,) and (3) Relevance Judgements.

The collection can be used for retrieval experiments of Japanese text retrieval and CLIR of search Either of English documents or Japanese-English documents by Japanese topics.

The Term Extraction Test collection includes tagged corpus using the 2000 Japanese documents selected from the above IR test collection.

13-11-2014 37

SUMM: (Text Summarization Test Collection)

The collection includes

Document data (Japanese newspaper articles Mainichi Newspaper (1998-1999,)

Model Summaries. Summary data consists of Single document summaries (Each of 60 documents, 7 types of single document

summaries prepared in different length by different strategies were prepared by 3 analysts) and

Multi-document summaries (Each of 50 document collections, 2 types of length of summaries were prepared by 3 analysis.

13-11-2014 38

WEB (IR Test Collection)

WEB test collection consists of "Document Data" which is a collection of text data processed from the crawled documents provided mainly on "Web servers of Japan" and "Task Data" which is a collection of search topics and the relevance judgments of the documents.

"Task Data" consists of 400 mandatory topics and 841 optional topics for 'Navigational Retrieval (Navi 2)'. "Document Data" named 'NW1000G-04' consists of web documents of approximately 1400GB in size and 100 million in number.

13-11-2014 39

13-11-2014 40

The CLEF Test Suite

The CLEF Test Suite contains the data used for the main tracks of the CLEF campaigns carried out from 2000 to 2003:

Multilingual text retrieval

Bilingual text retrieval

Monolingual text retrieval

Domain-specific text retrieval

13-11-2014 41

The CLEF Test Suite

The CLEF Test Suite is composed of:• The multilingual document collections• A Step-by-Step documentation on how to perform a system evaluation (EN)• Tools for results computation• Multilingual Sets of topics• Multilingual Sets of relevance assessments• Guidelines for participants (in English)• Tables of the results obtained by the participants;• Publications.

13-11-2014 42

The CLEF Test Suite

Multilingual corpora:• English• French• German• Italian• Spanish• Dutch• Swedish• Finnish• Russian• Portuguese

13-11-2014 43

CLEF AdHoc-News Test Suites (2004-2008)

The CLEF AdHoc-News Test Suites (2004-2008) contain the data used for the main AdHoc track of the CLEF campaigns carried out from 2004 to 2008.

This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual news collections.

13-11-2014 44

CLEF Domain Specific Test Suites (2004-2008)

The CLEF Domain SpecificTest Suites (2004-2008) contain the data used for the Domain Specific track of the CLEF campaigns carried out from 2004 to 2008.

This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual collections of scientific articles.

13-11-2014 45

CLEF Question Answering Test Suites (2003-2008)

The CLEF Question Answering Suites (2003-2008) contain the data used for the Question Answering (QA) track of the CLEF campaigns carried out from 2003 to 2008.

This track tested the performance of monolingual, bilingual and multilingual Question Answering systems on multilingual collections of news documents.

13-11-2014 46

Reuters Corpora

Reuters is the largest international text and television news agency. Its editorial division produces 11,000 stories a day in 23 languages.

Stories are both distributed in real time and made available via online databases and other archival products.

Datasets Reuters-21578 : used in text classification

RCV1

RCV2

TRC2

13-11-2014 47

RCV1

In 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems.

Known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older Reuters-21.578

Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire

RCV1 is drawn from one of those online databases. It was intended to consist of all and only English language stories produced by Reuters journalists between August 20, 1996, and August 19,1997

13-11-2014 48

RCV2

Multilingual Corpus, 1996-08-20 to 1997-08-19

contains over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish)

13-11-2014 49

Thomson Reuters Text Research Collection (TRC2)The TRC2 corpus comprises 1,800,370 news stories covering the

period from 2008-01-01 to 2009-02-28

Initially made available to participants of the 2009 blog track at the Text Retrieval Conference (TREC), to supplement the BLOGS08 corpus (that contains results of a large blog crawl carried out at the University of Glasgow).

13-11-2014 50

20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

It was originally collected by Ken Lang, for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection.

The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

13-11-2014 51

20 Newsgroups

Class # train docs # test docs Total # docs

alt.atheism 480 319 799

comp.graphics 584 389 973

comp.os.ms-windows.misc 572 394 966

comp.sys.ibm.pc.hardware 590 392 982

comp.sys.mac.hardware 578 385 963

comp.windows.x 593 392 985

misc.forsale 585 390 975

rec.autos 594 395 989

rec.motorcycles 598 398 996

rec.sport.baseball 597 397 994

rec.sport.hockey 600 399 999

sci.crypt 595 396 991

sci.electronics 591 393 984

sci.med 594 396 990

sci.space 593 394 987

soc.religion.christian 598 398 996

talk.politics.guns 545 364 909

talk.politics.mideast 564 376 940

talk.politics.misc 465 310 775

talk.religion.misc 377 251 628

Total 11293 7528 1882113-11-2014 52

References

http://data.sindice.com/trec2011http://data-portal.ecmwf.inthttp://www.findbestopensource.com/article-detail/free-large-data-corpushttp://mogadala.com/Toolkits_and_Datasets.htmlhttp://irkmlab.soe.ucsc.edu/DataSetsAvailableOnIRKMLabMachineshttp://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.htmlhttp://www.networkautomation.com/automate/urc/resources/livedocs/am/8/Advanced/About_Datasets.htmhttp://www.gabormelli.com/RKB/20_Newsgroups_Datasethttp://www.csmining.org/index.php/id-20-newsgroups.htmlhttp://research.nii.ac.jp/ntcir/permission/perm-en.html#ntcir-3-qa

13-11-2014 53

http://data.sindice.com/trec2011

http://data-portal.ecmwf.int/

http://www.findbestopensource.com/article-detail/free-large-data-corpus

http://mogadala.com/Toolkits_and_Datasets.html

http://irkmlab.soe.ucsc.edu/DataSetsAvailableOnIRKMLabMachines

http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html

http://www.networkautomation.com/automate/urc/resources/livedocs/am/8/Advanced/About_Datasets.htm

http://www.gabormelli.com/RKB/20_Newsgroups_Dataset

http://www.csmining.org/index.php/id-20-newsgroups.html

http://research.nii.ac.jp/ntcir/permission/perm-en.html#ntcir-3-qa

13-11-2014 54

Technology

Standard Datasets in Information Retrieval