Upload
christal-gibbs
View
217
Download
3
Embed Size (px)
Citation preview
Copyright © 2004 Sung Hyon Myaeng
Information RetrievalTutorial
2004. 2. 13
Information & Communications UniversityIR & NLP Lab
http://ir.icu.ac.kr맹 성 현
Copyright © 2004 Sung Hyon Myaeng
2
Outline
What is Information Retrieval (IR)? Overview of Core IR Technology Overall Directions IR Expanded
CLIR/MLIR Classification Topic Detection & Tracking Recommender Systems Summarization Question Answering Information Extraction
Copyright © 2004 Sung Hyon Myaeng
3
What is IR?
Traditional IR: Willow System Traditional IR: Willow System
Copyright © 2004 Sung Hyon Myaeng
4
What is IR?
Google Web Search Engine
Copyright © 2004 Sung Hyon Myaeng
5
What is IR?
Ask Jeeves
Copyright © 2004 Sung Hyon Myaeng
6
IR & the Rest of the World
AICognitiveScience
Linguistics
InformationRetrieval
NaturalLanguageProcessing
HumanComputerInteraction
Com
pute
r Scie
nce
Libra
ry &
Info
S
cien
ce
StatisticsDB
Copyright © 2004 Sung Hyon Myaeng
7
Evaluation of IR Systems
effectiveness “relevance”
precision: A / A+C recall: A / A+B
efficiency Interactive systems? Others?
0
0.10.2
0.30.4
0.50.6
0.70.8
0.9
System ASystem BSystem C
Ret
NOT
Ret
Rel A B
NOT
Rel C D
recall
pre
cis
ion
Copyright © 2004 Sung Hyon Myaeng
8
Overview of Text Retrieval
Text Processing User/System Interaction
Search Engine
Matching(Inferencing)
Text Analysis
Analysis ofInfo Needs
Rawtext
InfoNeeds
Index Query
Knowledge Resources & Tools
Retrieval Result
Copyright © 2004 Sung Hyon Myaeng
9
Text Processing (1) - Indexing
Extraction of index terms and computation of their weights Index terms: represent document content & separate documents
“economy” vs “computer” in a news article of Financial Times
Morphological Analysis (stemming in English) “ 벨기에는” (“ 벨기 +” 에는” ?), “ 문서내의” (“ 문서” +” 내의” ) “information”, “informed”, “informs”, “informative” Rule-based vs dictionary-based
n-gram “ 정보검색시스템” => “_ 정” , “ 정보” , “ 보검” , “ 검색” , … (bi-gram) “ 부정사” vs “ 부정한 정사” (similar enough in bi-gram!) Surprisingly effective in some languages
Copyright © 2004 Sung Hyon Myaeng
10
Text Processing (2) – Storing indexing results
A BE
A CF
C
F
A DG
BG
1 2 3 4 n
1 2 3 4 … nA v v vB v vC v vD vE vF vG v v
Invertedindex
Copyright © 2004 Sung Hyon Myaeng
11
Text Processing (3) - Indexing
Use of various linguistic resources Dictionaries (noun, Josa, Eomi, bilingual, Proper noun, foreign w
ords, …) For extraction and weighting of index terms
Thesaurus (e.g. WordNet) Controlled vocabulary indexing Matching similar and related words
Tagged Corpus
Most NLP technology is used for term extraction “Bag of words” approach Sense disambiguation? Word order?
Copyright © 2004 Sung Hyon Myaeng
12
Overview of Text Retrieval
Text Processing User/System Interaction
Search Engine
Matching(Inferencing)
text 분석
Analysis ofInfo Needs
rawtext
InfoNeeds
Index Query
Knowledge Resources & Tools
Retrieval Result
Copyright © 2004 Sung Hyon Myaeng
13
User/System Interaction – Query Models
Boolean AND, OR, NOT operators
E.g (semi-conductor OR chip) AND stock NOT chocolate) adjacency, phrase operators
E.g: “stock exchange”, “ 그리고 아무 말도 하지 않았다” ) Difficult for naïve users visual query interface
Word list Vector space model system
E.g.: (semi-conductor chip stock) Often interpreted as a Boolean query in search engines
E.g. (semi-conductor OR chip OR stock)
Copyright © 2004 Sung Hyon Myaeng
14
User/System Interaction – Query Models
“Natural Language” Query E.g.: “I want to get information about ski resorts in Kangwon-do
or in the Chungcheong area.” Limitations in NLP Various tricks
Query Expansion To resolve mismatches between query terms and index terms for
documents A variety of linguistic resources are used (e.g. synonym, foreign
word equivalence classes, bilingual dictionaries) Guide users to follow step-by-step instructions for detailed queries
“canned queries” (E.g.: “Ask Jeeves”) query templates
Copyright © 2004 Sung Hyon Myaeng
15
Ask Jeeves 화면
Copyright © 2004 Sung Hyon Myaeng
16
User/System Interaction – Query Models
Relevance feedback “Similar Pages” in Web search engines From a simple query to better queries progressively
Limited recall capability of human beings Recognition of a relevant document is much easier. Intended to ease the difficulty of grasping the statistical prope
rties of the entire collection An indirect way of capturing the user needs
User profile To reflect user’s interest and orientation in interpreting user
queries Need to gather & analyze user log data and learn user model
s
Copyright © 2004 Sung Hyon Myaeng
17
User/System Interaction – Result Presentation
Information overload problem – too many retrieved A simple ranked list - title, author, URL, date, …
Method 1: Organizing the retrieved documents Result Clustering (E.g. Vivisimo) “Zoom-in” operation (E.g: Scatter & Gather)
Method 2: Visualizing the retrieved documents Overview of a large amount of information Visual expression of document properties E.g. TileBar
Copyright © 2004 Sung Hyon Myaeng
18
Scatter/Gather
Copyright © 2004 Sung Hyon Myaeng
19
Tile Bar
Copyright © 2004 Sung Hyon Myaeng
20
Result Clustering
Copyright © 2004 Sung Hyon Myaeng
21
Text Retrieval Overview
Text Processing User/System Interaction
Search Engine
Matching(Inferencing)
text Analysis
Analysis ofInfo Needs
rawtext
InfoNeeds
Index Query
Knowledge Resources & Tools
Retrieval Result
Copyright © 2004 Sung Hyon Myaeng
22
Matching & Ranking (1)
Inverted File, …
가구가야
.
.
.
신라.
.
.
.
.
.
호랑이
0.7
0.9
.
.
.
0.9
.
.
.
.
.
.
0.6
3
.
.
.
2
.
.
.
.
.
.
2
12345...
275276
.
.
.
.
10111012
12546...35....
14
Terms Wt Pointers
Directory Posting file
Doc #1---------------
Doc #2---------------
Doc #5---------------
...
...
Query
Copyright © 2004 Sung Hyon Myaeng
23
Matching & Ranking (2)
Ranking Retrieval Model
Boolean (exact) => Fuzzy Set (inexact) Vector Space Probabilistic Inference Net …
Weighting Schemes Index terms, query terms Parameters in formulas Document characteristics …
Copyright © 2004 Sung Hyon Myaeng
24
IR Model Example: Vector Space Model
<DOC 1>... cat ........ dog ......................dog................mouse .....dog........mouse ........................
Q = < cat, mouse, 0 >
Di = (di1, di2, ... , din)Q = (q1, q2, ... , qn)Similarity = Di . Q / |Di|*|Q|
dog
mouse
cat
D1
Q
Copyright © 2004 Sung Hyon Myaeng
25
Matching & Ranking (3)
Techniques for efficiency New storage structure esp. for new document types Use of accumulators for efficient generation of ranked output Compression/decompression of indexes
Technique for Web search engines Use of hyperlinks
Inlinks & outlinks Authority vs hub pages
In conjunction with Directory Services (e.g. Yahoo) Softbot – storing terabytes of data and efficient crawling ...
Copyright © 2004 Sung Hyon Myaeng
26
Web document retrieval – using hyperlinks
TERM
InitialRetrievalSet
Candidates foradditional retrieval
To be rankedagain using thelink information
A
BC
A: Hub documentB: Authority document Increase the weight of A, B
Copyright © 2004 Sung Hyon Myaeng
27
Characteristics of IR - summary
Exact Match
Structured
Retrieval ModelsProbabilistic Deterministic
IndexingDerived from contents
Complete Items
Matching/RetrievalPartial or “Best” Match
Query TypesNatural Language
Results CriteriaRelevance Any Match
Results OrderingRanked Arbitrary
Information Retrieval
Data Retrieval
Information Retrieval/Data Retrieval Spectrum
Unstructured vs Structured
Copyright © 2004 Sung Hyon Myaeng
28
Overall Directions (1)
Efforts to improve retrieval effectiveness (as always!) Retrieval model, text analysis and representation, user interactions,
... Specialized Search: domain-specific
Context awareness (personalization, task-centered) profile, session logs, task models, etc.
Multi-something multimedia, multi-style, multilingual,…
Distributed Environment with a large quantity Web search, meta-search, distributed retrieval (DB segmentation),
meta-data retrieval, semantic Web New functionality
filtering, TDT, classification, summarization, QA, information extraction, ...
Copyright © 2004 Sung Hyon Myaeng
29
Cross-Language & Multilingual IR
Cross-language IR Using a language (mother tongue) to retrieve documents in a
nother language To overcome the language differences [terminology] cross-lingual, translingual (DARPA)
Multilingual IR “retrieving relevant document in any of the languages contai
ned in a multilingual document collection” (CL+ | ML) Document Retrieval E.g.: Using Korean queries to search a DB consisting of Kor
ean, English and Japanese documents
Copyright © 2004 Sung Hyon Myaeng
30
CLIR
The number of documents in languages other than own is rapidly increasing, and so is the need for retrieval. The rate of annual increase for documents in the WEB
English: 50%; All other languages: 90%
Multilingual countries, organizations, enterprises, & users The limitation of machine translation technology
More economical to translate necessary document after retrieval Not easy to construct a query in a foreign language even with the
ability to comprehend written materials in the language reading vs. writing
CLIR is fundamental to other multilingual information access technology
Copyright © 2004 Sung Hyon Myaeng
31
CLIR Problem (example)
현대자동차 주식 동향
same villageeastern exposuretrend
principal foodstocksfood and drink
??
현 대자 동차 현대 자동차
3 4
Copyright © 2004 Sung Hyon Myaeng
32
Retrieval of Structured Documents
XML documents, hypertext, metadata, semantic Web Queries for structure and content
FIND a document that INCLUDES a chapter whose title
CONTAINS the term “hypertext” AND whose section CONTAINS the term “browsing”.
Queries for content and link FIND all documents about “information retrieval” that is referred t
o by a paper written by “Myaeng”.
Retrieval with ontology vs retrieval from ontology (e.g. RQL)
Copyright © 2004 Sung Hyon Myaeng
33
Classification - Motivation
Sender: [email protected]: Your Business Listing - Global Trade IndexDate: 2003-09-03 ( 수 ) 오전 6:40Size: 6 KB
Dear Site Owner,You are invited to list your site at the most important Trade Directory on the Internet. This directory sy
stem has attributes no other directory on the Internet has had do date, check us out! Manufacturers - Wholesalers - Distributors - Resellers
and all businesses that are associated are welcome on our directory.
Your business will prosper from its association with our global resources!There are NO CHARGES for a listing. Just click here and
enter your business details as you like. Thank you for your kind assistance in this matter,
The Team [email protected] Free - 866 516-8412
Is this spam?
Copyright © 2004 Sung Hyon Myaeng
34
Classification - Motivation
Re: 요청하신 자료입니다 . 그렇게 가면 안되지 . Get V^iagram in the convenience of your home Generi.c Cia.lis – Lasts 2 times longer then Via.gra! Re,no-va”te”*”you:r ‘d*ow_nst>;airs;^-
How about these subject lines?
Copyright © 2004 Sung Hyon Myaeng
35
Classification Problem
E-mail classification Given an e-mail message containing question and/or complai
nt, where should it be sent in ERMS? Categories:
AS, subscription/unsubscription, passwords, upgrades, usage, about-products, other questions, other complaints
A not-so-easy example:
제목 : 아뒤랑 비번를 잊어먹었슴다 .본문안녕하세여…오랫만에 홈에 들렀더니…제가 그만 아뒤랑 pass 를 잊어버렸네요 음…
Copyright © 2004 Sung Hyon Myaeng
36
Problem Statement
Given: A description of an instance, xX, where X is the instance l
anguage or instance space. A fixed set of categories:
C = {c1, c2,…, cn}
Determine: The category of x: c(x)C, where c(x) is a categorization fu
nction whose domain is X and whose range is C.
How to represent text documents?
how to build categorization functions (“classifiers”).
Copyright © 2004 Sung Hyon Myaeng
37
A Typical Example for Document Classification
Multimedia GUIGarb.Coll.SemanticsML Planning
planningtemporalreasoningplanlanguage...
programmingsemanticslanguageproof...
learningintelligencealgorithmreinforcementnetwork...
garbagecollectionmemoryoptimizationregion...
“planning language proof intelligence”
TrainingData:
TestingData:
Classes:(AI) (Programming) (HCI)
... ...
Copyright © 2004 Sung Hyon Myaeng
38
Classification Methods
Rule-based Methods E.g.: assign a category if document contains a given Boolean com
bination of words Accuracy is often very high if a profile has been carefully refined
over time by a subject expert. Building and maintaining these profiles is expensive.
Inductive Learning Models Naïve Bayesian Model Decision Tree Model SVM (Support Vector Machine)
Similarity-Based Models K-Nearest Neighbor Rocchio’s Model
These require hand-classified training data, but can be built (and refined) easily.
Copyright © 2004 Sung Hyon Myaeng
39
Topic Detection & Tracking (TDT) (1)
Event: A reported occurrence at a specific time and place, and the unavoidable consequences.
Specific elections, accidents, crimes, natural disasters. “TWA-800 airplane crash” vs. “airplane accidents”
Activity: A connected set of actions that have a common focus or purpose
campaigns, investigations, disaster relief efforts
Topic: a seminal event or activity, along with all directly related events and activities
Story: a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single topic
Copyright © 2004 Sung Hyon Myaeng
40
TDT (2) – First Story Detection
Automatically identify the first story on a new event from a stream of text
To detect the first story that discusses a topic, for all topics.
Time
First Stories
Not First Stories
= Topic 1= Topic 2
Copyright © 2004 Sung Hyon Myaeng
41
TDT (3) – More about FSD
First story detection is an unsupervised learning task. There is no supervised training.
On-line vs. Retrospective On-line: Flag onset of new events from live news feeds as
stories come in Retrospective: Detection consists of identifying first story
looking back over longer period Lack of advance knowledge of new events, but have
access to unlabeled historical data as a contrast set Applications
Intelligence services Finance: Be the first to trade a stock
Copyright © 2004 Sung Hyon Myaeng
42
TDT (4) – Other Tasks
Topic Tracking Once a topic has been detected, identify subsequent stories
about it Standard text classification task However, very small training set (initially: 1!)
Topic Detection Grouping stories from an accumulated collection Event-based classification with multiple topics (events) Retrospective
Copyright © 2004 Sung Hyon Myaeng
43
TDT (5)
Characteristics of Events News articles on an event are temporally close to each other.
lexical and temporal similarities “Similar” news articles over an extended time period Different
events Use a time window to determine the scope of an event
Changes in the used terms and their frequencies new event Use of clustering techniques (example)
retrospective: bottom-up clustering with a time window (TD) Online (Tracking)
single-pass, incremental clustering incremental IDF use of decaying function for old documents in computing the
similarity
Copyright © 2004 Sung Hyon Myaeng
44
Recommender Systems (1)
Recommender systems are a technological proxy for a social process We rely on recommendations from other people. An information discovery model where people try to find
other people with similar tastes and then ask them to suggest new things
In a typical recommender system People provide recommendations (input) The system aggregates and directs to appropriate recipients.
Copyright © 2004 Sung Hyon Myaeng
45
Recommender Systems (2)
Motivations Can we automatically aggregate quotes like:
"I like this book; you might be interested in it" "I saw this movie, you’ll like it“ "Don’t go see that movie!“
Finding new books, music, or movies, previously unknown to users
Applications Corporate Intranets
Recommendations, finding domain experts, … Ecommerce
Product recommendations – amazon, CDNOW, … Medical Applications
Matching patients to doctors, clinical trials, … Customer Relationship Management
Matching customer problems to internal experts in a support organization
Copyright © 2004 Sung Hyon Myaeng
46
Recommender Systems (3) - Types
Collaborative/Social-filtering system – aggregation of consumers’ preferences and recommendations to other users based on similarity in behavioral patterns
Content-based system – supervised machine learning used to induce a classifier to discriminate between interesting and uninteresting items for the user
Knowledge-based system – knowledge about users and products used to reason what meets the user’s requirements, using discrimination tree, decision support tools, case-based reasoning (CBR)
Copyright © 2004 Sung Hyon Myaeng
47
Recommender Systems (4) - Example
Copyright © 2004 Sung Hyon Myaeng
48
Automatic Summarization (1)
Functionality indicative: to determine if the document would be of any interest informative: to reflect the original content as faithfully as possible under t
he compression rate evaluative: evaluation of the original document
Fluency fragmented connected text
Users generic user (query)- focused
Target Documents single vs multiple documents
Copyright © 2004 Sung Hyon Myaeng
49
Automatic Summarization (2)
Word FrequenciesClue PhrasesLayoutSyntaxSemanticsDiscoursePragmatics
Word CountClue PhrasesStatisticalStructural
AbstractionAggregation
PlanningRealizationLayout
Source(s)Summary
Analysis Selection Condensation Presentation
IntermediaryRepresentation
Copyright © 2004 Sung Hyon Myaeng
50
Automatic Summarization (3) - Approaches
Natural language understanding / generation Build knowledge representation of text Generate sentences summarizing content Hard to do well
Keyword summaries Display most significant keywords Easy to do Hard to read, poor representation of content
Sentence extraction Extract key sentences Medium hard Summaries often don’t read well Good representation of content
Copyright © 2004 Sung Hyon Myaeng
51
Question Answering (Q/A) (1)
To provide an answer to a query, as opposed to a document Query: for factoid or exact answer Result: Ranked list of <document, answer string> pairs
Answer string: 50-250 bytes Documents: to supplement the answer
Example from TREC-9 How much folic acid should an expectant mother get daily? Who invented the paper clip? Where is Rider College located? Name a film in which Jude Law acted.
Copyright © 2004 Sung Hyon Myaeng
52
Question Answering (2)
System Flow (example)
QueryAnalyzer
QueryAnalyzer
DocAnalyzer
DocAnalyzer
RetrievalEngine
RetrievalEngine
AnswerExtractorAnswer
Extractor
RetrievedDocs
Candidates
QuerySet
QueryCategories
`
Thesaurus
Rule SetDocument
Set
UserQuery
Answer
Copyright © 2004 Sung Hyon Myaeng
53
Question Answering (3)
What is the fare cost for the round tripbetween New York andLondon on Concorde?
What [be] [ADJ] [NOUN] for
Query
RuleApplied
fare costfare cost
Financial lossFinancial lossCategorizationOf the phrase
Main phraseExtracted
Assign aQuery category
Assignment of query categories
Copyright © 2004 Sung Hyon Myaeng
54
Information Extraction
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1
Extracting Job Openings from the Web
Copyright © 2004 Sung Hyon Myaeng
55
Information Extraction (2)
Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources
Identify specific pieces of information in a un-structured or semi-structured textual document.
Transform this unstructured information into structured relations in a database/ontology.
...... 일본 동경에서의 테러 ...
...............................................4 월 11 일 오후 ..... 상무성 장관....... 동경시 ... 사제폭탄 .........................................................행인 5 명 중상 ..... 자동차 2 대 ........... 피해 .......
....... 독일에서의 경우 ............
..........
사건 테러일시 4.12 오후장소 동경시
목표 상무성장관
인명 피해 5 명 중상재산 피해 자동차 2 대. . .
. . .
DB
Copyright © 2004 Sung Hyon Myaeng
56
Information Extraction (3) – flow (example)
document
lexical analysislexical analysis
name recognitionname recognition
partial syntactic analysispartial syntactic analysis
scenario pattern matchingscenario pattern matching
coreference analysiscoreference analysis
template generationtemplate generation
inferenceinference
extracted templates
discourse analysis
local text analysis
[Grishman, 1997]
Copyright © 2004 Sung Hyon Myaeng
57
Information Extraction:MUC (State of the Art – 1997)
NE – named entity recognitionCO – coreference resolutionTE – template element constructionTR – template relation constructionST – scenario template production
Copyright © 2004 Sung Hyon Myaeng
58
Knowledge Extraction Vision
Multi-dimensional
Meta-data Extraction
J F M A M J J A
EMPLOYEE / EMPLOYER Relationships:Jan Clesius works for Clesius EnterprisesBill Young works for InterMedia Inc.COMPANY / LOCATION Relationshis:Clesius Enterprises is in New York, NYInterMedia Inc. is in Boston, MA
Meta-Data
India Bombing NY Times Andhra Bhoomi Dinamani Dainik Jagran
Topic Discovery
Concept Indexing
Thread Creation
Term Translation
Document Translation
Story Segmentation
Entity Extraction
Fact Extraction
Copyright © 2004 Sung Hyon Myaeng
59
References
Belkin, N.J. & Croft, W.B. (1987). Retrieval Techniques. In: Williams, M.E., Annual Review of Information Science and Technology 22(), 109-145, New York: Elsevier & ASIS.
E. Glover, S. Lawrence, M. Gordon, W. Birmingham, Lee-Giles (2001). Web Search – Your Way. Comm. of the ACM, 44 (12), 97-102.
Gregory Grefenstette (1998). “The Problem of Cross-Language Information Retrieval.” In Cross-Language Information Retrieval (ed: Grefenstette), Kluwer Academic Publishers.
Doug Oard et al. (1999). “Multilingual Information Discovery and AccesS (MIDAS).” D-Lib Magazine, 5 (10), Oct.
Sung Hyon Myaeng et al. (1998). “ A Flexible Model for Retrieval of SGML Documents.” Proc. of the 21st ACM SIGIR Conference, Austrailia.
Copyright © 2004 Sung Hyon Myaeng
60
References
James Allan (2002). “Introduction to Topic Detection and Tracking.” in Topic Detection and Tracking: Event-based Information Organization (ed: Allan), Kluwer Academic Publishers.
Paul Resnick & Hal Varian (1997). “Recommender Systems.” CACM 40 (3), March, pp 56-58.
Bardrul Sarwar et al. (2001). “Item-based Collaborative Recommendation Algorithms”, http://citeseer.nj.nec.com/sarwar01itembased.html
Karen Sparck Jones (1999). “Automatic summarizing: factors and directions.” In Advances in Automatic Text Summarization (eds: Mani & Maybury), MIT Press.
Ellen Boorhees. (2000). “Overview of TREC-9 Question Answering Track.” Ralph Grishman (1997). “Information Extraction: Techniques and Challenge
s.” In Information Extraction - International Summer School SCIE-97, (ed: Maria Teresa Pazienza), Springer-Verlag, 1997. (See http://nlp.cs.nyu.edu/publication/index.shtml)
Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.