Upload
jean-brenda-soundaralingam
View
295
Download
1
Embed Size (px)
Citation preview
Standard Datasets in IR
By: S.J Brenda
13-11-2014 1
What is dataset
Collection of documents
The data sets includes Document set : Collection of documentsQuery set: Set of information needs collection of
questions asking the IR system for resultsRelevant judgement set: Methods calculate da relevance
between results set & queries
13-11-2014 2
Standard dataset
Test collections
Consists of static set of documentsA set of information needs/topicsA set of known relevant documents for each of the
information needs
Prefer to be in large scaleProper validationRapid growthGreat diversity
13-11-2014 3
WHY we need
Test the systemDetermine how well IR systems perform Compare the performance of the IR system with that of other
systems Compare search algorithms with each other Compare search strategies with each other
13-11-2014 4
Standard Datasets in IR
The Cranfield collection
Text Retrieval Conference (TREC)
Gov2
NII Test Collection for IR system (NTCIR)
Cross Language Evaluation Forum (CLEF)
Reuters
20Newsgroups
13-11-2014 5
The Cranfield collection
This was the pioneering test collection in allowing precise quantitative measures of information retrieval effectiveness
Created in late 1960 The Cranfield corpus is a relatively small information retrieval
corpus consisting of 1400 abstracts on aeronautical engineering topics. The documents contain a total of 136935 terms from a vocabulary of size 4,632 (after stop word removal).
13-11-2014 6
The Cranfield collection
The Cranfield corpus also contains a set of 225 query strings with ground truth relevance judgements.
Source: abstract of scientific papers from aeronautic research field 1945-1962
13-11-2014 7
The Cranfield collection
Experimental assumptions Relevance = topical similarity Static information need All documents equally desirable Relevance judgments are complete and representative of the
user population
Drawbacks Lack of comparison between different systems Small collection not enough
13-11-2014 8
TREC
The TREC corpus is a large data set consisting of articles taken from a variety of newswire and other sources
This data set consists of 528,155 documents spanning a total of 165,363,765 terms from a vocabulary of size 629,469 (after stop words removal)
Also provided are a number of query strings consisting of three parts, a title, description and narrative. Ground truth judgements are available concerning whether or not each of the documents is relevant to each of the queries.
13-11-2014 9
TREC
Design for large data collection
Retrieval method:Not tied to any applicationAdhoc query: in a library
environmentRouting query: filter the result set
Useful for IR system designers
Good for dedicated searcher not novice searcher
13-11-2014 10
TREC- Documents
13-11-2014 11
TREC-Relevance Judgement
3 methodsOn all documents, all topicsRandom sample of documentsPooling
• Divide each set into topic
• Select top ranked documents
• Each topic merge with the results of other system
• Sort based on document number
• Remove duplicate documents
13-11-2014 12
TREC datasetsContextual Suggestion Track :
To investigate search techniques for complex information needs that are highly dependent on context and user interests.
Clinical Decision Support Track : To investigate techniques for linking medical cases to information relevant for
patient care
Federated Web Search Track : To investigate techniques for the selection and combination of search results
from a large number of real on-line web search services.
Knowledge Base Acceleration Track : To develop techniques to dramatically improve the efficiency of (human)
knowledge base curators by having the system suggest modifications/extensions to the KB based on its monitoring of the data streams.
13-11-2014 13
Microblog Track: To examine the nature of real-time information needs and their satisfaction in
the context of microblogging environments such as Twitter.
Session Track : To develop methods for measuring multiple-query sessions where
information needs drift or get more or less specific over the session.
Temporal Summarization Track : To develop systems that allow users to efficiently monitor the information
associated with an event over time.
Web Track: To explore information seeking behaviours common in general web search.
13-11-2014 14
Chemical Track : To develop and evaluate technology for large scale search in chemistry-related
documents, including academic papers and patents, to better meet the needs of
professional searchers, and specifically patent searchers and chemists.
Crowdsourcing Track : To provide a collaborative venue for exploring crowd sourcing methods both for
evaluating search and for performing search tasks.
Genomics Track: To study the retrieval of genomic data, not just gene sequences but also supporting
documentation such as research papers, lab reports, etc. Last ran on TREC 2007.
Enterprise Trac: To study search over the data of an organization to complete some task. Last ran
on TREC 200813-11-2014 15
Entity Track : To perform entity-related search on Web data. These search tasks (such as finding entities
and properties of entities) address common information needs that are not that well modeled
as ad hoc document search.
Cross-Language Track : To investigate the ability of retrieval systems to find documents topically regardless of source
language.
FedWeb Track : To select best resources to forward a query to, and merge the results so that most relevant
are on the top.
Filtering Track : To binarily decide retrieval of new incoming documents given a stable
information need.
HARD Track : To achieve High Accuracy Retrieval from Documents by
leveraging additional information about the searcher
and/or the search context.
•.
13-11-2014 16
Interactive Track : To study user interaction with text retrieval systems.
Legal Track :
To develop search technology that meets the needs of lawyers to engage in
effective discovery in digital document collections.
Medical Records Track :
To explore methods for searching unstructured information found in patient
medical records.
Novelty Track :
To investigate systems' abilities to locate new (i.e., non-redundant)
information.
13-11-2014 17
Question Answering Track : To achieve more IR than just Document Retrieval by answering factoid, list and
definition-style questions.
Robust Retrieval Track : To focus on individual topic effectiveness.
Relevance FeedbackTrack : To further deep evaluation of relevance feedback processes.
Spam Track : To provide a standard evaluation of current and proposed Spam
filtering approaches.
TeraByteTrack : To investigate whether/how the IR community can scale traditional IR test-
collection-based evaluation to significantly large collections.
VideoTrack : To research in automatic segmentation, indexing, and content-based retrieval of
digital video.In 2003, this track became its own independent evaluation named TRECVID
13-11-2014 18
AdvantagesLarger Collections
Better Results
Usable in many tasks Filtering
Web search
Video retrieval
CLEF
NTCIR
GOV2
Drawback: Complete judgement is impossible Pooling can overcome this
13-11-2014 19
Gov2
A TREC corpus consist of 25 million documents from US
government domains and government related websites
TREC topics 701-850 used for evaluation
One of the largest web collection easily available for research
purposes
13-11-2014 20
NTCIR - NII Test Collection for IR system
Built various test collections of similar sizes to the TREC
Focus on East Asian Language and Cross Language information retrieval Query one language document collection more than one languages
13-11-2014 21
13-11-2014 22
CLIR: IR/CLIR test collection
CLIR test collection can be used for experiments of cross-lingual information retrieval between Chinese(traditional), Japanese, Korean and English (CJKE) such asMultilingual CLIR (MLIR)
Bilingual CLIR (BLIR)
Single Language (Monolingual) IR (SLIR).
13-11-2014 23
CLQA (Cross Language Q&A data Test Collection) CLQA Task, the following subtasks were conducted.
1. Japanese to English (J-E) subtaskFind answers of Japanese questions in English documents.
2. Chinese to English (C-E) subtaskFind answers of Chinese questions in English documents.
3. English to Japanese (E-J) subtaskFind answers of English questions in Japanese documents.
4. English to Chinese (E-C) subtaskFind answers of English questions in Chinese documents.
5. Chinese to Chinese (C-C) subtaskFind answers of Chinese questions in Chinese documents.
13-11-2014 24
ACLIA (Advanced Cross-Lingual Information Retrieval and Question Answering)
ACLIA test collection can be used for experiments of Complex Question Answering and Information Retrieval between Chinese (Simplified (CS), Traditional (CT)), Japanese (JA), English (EN) such as
CCLQA (Complex Cross-Lingual Question Answering) Cross-Lingual Question Answering (EN-JA/EN-CS/EN-CT subtask)
Monolingual Question Answering (JA-JA, CS-CS, and CT-CT subtask)
IR4QA (Information Retrieval for Question Answering) Cross-Lingual Information Retrieval (EN-JA/EN-CS/EN-CT subtask)*
Monolingual Information Retrieval (JA-JA, CS-CS, and CT-CT subtask)
13-11-2014 25
CQA (Community QA Test Collection)
This test collection can be used to evaluate the quality of the answer on the CQA site.This test collection consists of the following data. 1500 questions extracted from Yahoo Chiebukuro data version 1.
Assessment results by four assessors
ID lists, best answer lists, and category information, etc.
13-11-2014 26
CROSSLINK (Cross-lingual Link Discovery)
Crosslink test collection can be used for experiments of cross-lingual link discovery from English to CJK (Chinese, Japanese and Korean) document linking such as English to Chinese CLLD (E2C) subtask
English to Japanese CLLD (E2J) subtask
English to Korean CLLD (E2K) subtask
13-11-2014 27
INTENT (INTENT-1)
INTENT (INTENT-1) Test Collections are the following:
(a) NTCIR-9 INTENT Chinese Subtopic Mining Test Collection(b) NTCIR-9 INTENT Japanese Subtopic Mining Test Collection(c) NTCIR-9 INTENT Chinese Document Ranking Test Collection(d) NTCIR-9 INTENT Japanese Document Ranking Test Collection
Subtopic Mining Subtask: given a query, return a ranked list of "subtopic strings."
Document Ranking Subtask: given a query, return a diversified list of web pages.
13-11-2014 28
1CLICK
1CLICK (1CLICK-1) Test Collection was the test collection used at the NTCIR-9 1CLICK (Once Click Access) task.
The NTCIR-9 1CLICK task was: given a Japanese query, return a 500-character or 140-character textual output that aims to satisfy the user as quickly as possible. Important pieces of information should be prioritized and the amount of text the user has to read should be minimized.
13-11-2014 29
Math
Math task was: NTCIR Math Task aims to explore search methods tailored to mathematical content through the design of suitable search tasks and the construction of evaluation datasets.
Math Retrieval Subtask: Given a document collection, retrieve relevant mathematical formulae or documents for a given query.
Math Understanding Subtask: Extract natural language descriptions of mathematical expressions in a document for their semantic interpretation. Relevance judgment
*added Oct/14/2014*
13-11-2014 30
MuST ("summary and visualization of trend information" test collection)
MuST Corpus (untagged), selected from the two years of 1999, is the
581 articles (27 topics) that were used to create task data.
A list of articles is assumed to obtained by the information retrieval
Annotated with to the article content: Extraction result of important
sentences in summary, and corresponds to the semantic processing
results for it
13-11-2014 31
Opinion (Opinion Analysis Task Test Collection)
There are 32 topics ranging from 1998-2001, each in English, Chinese, and Japanese.
The annotations assign opinion tags to sentences in the selected documents that are relevant to the topics.
The documents that are annotated are separately distributed in a sentence-segmented format that aligns with the sentence numbering in the CSV annotation file
13-11-2014 32
MOAT (Multilingual Opinion Analysis Test Collection)
MOAT test collection can be used for experiments of multi-lingual opinion analysis in Japanese, English, and Chinese (simplified/traditional) (CstJE) such as
Opinion Judgement
Polarity (Positive/Negative/Neutral) Judgement
Opinion Holder Identification
Opinion Target Identification
Relevance Judgement.
13-11-2014 33
PATENT (IR Test Collection)
• The collection consists of Document data (Japanese patent applications 1993-1997 and Patent Abstracts of Japan 1993-1997), 101 Japanese search topics (34 topics were translated into English, Simplified and Traditional Chinese, and Korean, respectively), and Relevance judgments for each search topic.
• Japanese patent applications published in 1993-1997 are used for the Retrieval task.
• Each search topic is a claim extracted from Japanese patent applications.
13-11-2014 34
Q&A data Test Collection
The collection includes:
Document data: Mainichi Newspaper articles 1998-2001
Task data: questions(200, in japanese), system’s output, human’s
output and sample answers
13-11-2014 35
SpokenDoc-2 (IR for Spoken Documents)
lecture speech, spoken passage, conversation detection
The test collection includes:
Spoken Term Detection (STD)
inexistent Spoken Term Detection (iSTD) task
Content Retrieval (SCR) Scoring tool for STD and iSTD task
Scoring tool for SCR task
13-11-2014 36
IR and Term Extraction/Role Analysis Test Collections
The IR Test collection includes Document data (Author abstracts of the Academic Conference Paper
Database (1988-1997) = author abstracts of the paper presented at the academic conference hosted by either of 65 academic societies in Japan. about 330,000 documents; more than half are English-Japanese paired,)
83 Search topics (Japanese,) and (3) Relevance Judgements.
The collection can be used for retrieval experiments of Japanese text retrieval and CLIR of search Either of English documents or Japanese-English documents by Japanese topics.
The Term Extraction Test collection includes tagged corpus using the 2000 Japanese documents selected from the above IR test collection.
13-11-2014 37
SUMM: (Text Summarization Test Collection)
The collection includes
Document data (Japanese newspaper articles Mainichi Newspaper (1998-1999,)
Model Summaries. Summary data consists of Single document summaries (Each of 60 documents, 7 types of single document
summaries prepared in different length by different strategies were prepared by 3 analysts) and
Multi-document summaries (Each of 50 document collections, 2 types of length of summaries were prepared by 3 analysis.
13-11-2014 38
WEB (IR Test Collection)
WEB test collection consists of "Document Data" which is a collection of text data processed from the crawled documents provided mainly on "Web servers of Japan" and "Task Data" which is a collection of search topics and the relevance judgments of the documents.
"Task Data" consists of 400 mandatory topics and 841 optional topics for 'Navigational Retrieval (Navi 2)'. "Document Data" named 'NW1000G-04' consists of web documents of approximately 1400GB in size and 100 million in number.
13-11-2014 39
13-11-2014 40
The CLEF Test Suite
The CLEF Test Suite contains the data used for the main tracks of the CLEF campaigns carried out from 2000 to 2003:
Multilingual text retrieval
Bilingual text retrieval
Monolingual text retrieval
Domain-specific text retrieval
13-11-2014 41
The CLEF Test Suite
The CLEF Test Suite is composed of:• The multilingual document collections• A Step-by-Step documentation on how to perform a system evaluation (EN)• Tools for results computation• Multilingual Sets of topics• Multilingual Sets of relevance assessments• Guidelines for participants (in English)• Tables of the results obtained by the participants;• Publications.
13-11-2014 42
The CLEF Test Suite
Multilingual corpora:• English• French• German• Italian• Spanish• Dutch• Swedish• Finnish• Russian• Portuguese
13-11-2014 43
CLEF AdHoc-News Test Suites (2004-2008)
The CLEF AdHoc-News Test Suites (2004-2008) contain the data used for the main AdHoc track of the CLEF campaigns carried out from 2004 to 2008.
This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual news collections.
13-11-2014 44
CLEF Domain Specific Test Suites (2004-2008)
The CLEF Domain SpecificTest Suites (2004-2008) contain the data used for the Domain Specific track of the CLEF campaigns carried out from 2004 to 2008.
This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual collections of scientific articles.
13-11-2014 45
CLEF Question Answering Test Suites (2003-2008)
The CLEF Question Answering Suites (2003-2008) contain the data used for the Question Answering (QA) track of the CLEF campaigns carried out from 2003 to 2008.
This track tested the performance of monolingual, bilingual and multilingual Question Answering systems on multilingual collections of news documents.
13-11-2014 46
Reuters Corpora
Reuters is the largest international text and television news agency. Its editorial division produces 11,000 stories a day in 23 languages.
Stories are both distributed in real time and made available via online databases and other archival products.
Datasets Reuters-21578 : used in text classification
RCV1
RCV2
TRC2
13-11-2014 47
RCV1
In 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems.
Known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older Reuters-21.578
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire
RCV1 is drawn from one of those online databases. It was intended to consist of all and only English language stories produced by Reuters journalists between August 20, 1996, and August 19,1997
13-11-2014 48
RCV2
Multilingual Corpus, 1996-08-20 to 1997-08-19
contains over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish)
13-11-2014 49
Thomson Reuters Text Research Collection (TRC2)The TRC2 corpus comprises 1,800,370 news stories covering the
period from 2008-01-01 to 2009-02-28
Initially made available to participants of the 2009 blog track at the Text Retrieval Conference (TREC), to supplement the BLOGS08 corpus (that contains results of a large blog crawl carried out at the University of Glasgow).
13-11-2014 50
20 Newsgroups
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
It was originally collected by Ken Lang, for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection.
The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
13-11-2014 51
20 Newsgroups
Class # train docs # test docs Total # docs
alt.atheism 480 319 799
comp.graphics 584 389 973
comp.os.ms-windows.misc 572 394 966
comp.sys.ibm.pc.hardware 590 392 982
comp.sys.mac.hardware 578 385 963
comp.windows.x 593 392 985
misc.forsale 585 390 975
rec.autos 594 395 989
rec.motorcycles 598 398 996
rec.sport.baseball 597 397 994
rec.sport.hockey 600 399 999
sci.crypt 595 396 991
sci.electronics 591 393 984
sci.med 594 396 990
sci.space 593 394 987
soc.religion.christian 598 398 996
talk.politics.guns 545 364 909
talk.politics.mideast 564 376 940
talk.politics.misc 465 310 775
talk.religion.misc 377 251 628
Total 11293 7528 1882113-11-2014 52
References
http://data.sindice.com/trec2011http://data-portal.ecmwf.inthttp://www.findbestopensource.com/article-detail/free-large-data-corpushttp://mogadala.com/Toolkits_and_Datasets.htmlhttp://irkmlab.soe.ucsc.edu/DataSetsAvailableOnIRKMLabMachineshttp://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.htmlhttp://www.networkautomation.com/automate/urc/resources/livedocs/am/8/Advanced/About_Datasets.htmhttp://www.gabormelli.com/RKB/20_Newsgroups_Datasethttp://www.csmining.org/index.php/id-20-newsgroups.htmlhttp://research.nii.ac.jp/ntcir/permission/perm-en.html#ntcir-3-qa
13-11-2014 53
13-11-2014 54