Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab 맹

Copyright © 2004 Sung Hyon Myaeng

Information RetrievalTutorial

2004. 2. 13

Information & Communications UniversityIR & NLP Lab

http://ir.icu.ac.kr맹 성 현


2

Outline

What is Information Retrieval (IR)? Overview of Core IR Technology Overall Directions IR Expanded

CLIR/MLIR Classification Topic Detection & Tracking Recommender Systems Summarization Question Answering Information Extraction


3

What is IR?

Traditional IR: Willow System Traditional IR: Willow System


4

What is IR?

Google Web Search Engine


5

What is IR?

Ask Jeeves


6

IR & the Rest of the World

AICognitiveScience

Linguistics

InformationRetrieval

NaturalLanguageProcessing

HumanComputerInteraction

Com

pute

r Scie

nce

Libra

ry &

Info

S

cien

ce

StatisticsDB


7

Evaluation of IR Systems

effectiveness “relevance”

precision: A / A+C recall: A / A+B

efficiency Interactive systems? Others?

0

0.10.2

0.30.4

0.50.6

0.70.8

0.9

System ASystem BSystem C

Ret

NOT

Ret

Rel A B

NOT

Rel C D

recall

pre

cis

ion


8

Overview of Text Retrieval

Text Processing User/System Interaction

Search Engine

Matching(Inferencing)

Text Analysis

Analysis ofInfo Needs

Rawtext

InfoNeeds

Index Query

Knowledge Resources & Tools

Retrieval Result


9

Text Processing (1) - Indexing

Extraction of index terms and computation of their weights Index terms: represent document content & separate documents

“economy” vs “computer” in a news article of Financial Times

Morphological Analysis (stemming in English) “ 벨기에는” (“ 벨기 +” 에는” ?), “ 문서내의” (“ 문서” +” 내의” ) “information”, “informed”, “informs”, “informative” Rule-based vs dictionary-based

n-gram “ 정보검색시스템” => “_ 정” , “ 정보” , “ 보검” , “ 검색” , … (bi-gram) “ 부정사” vs “ 부정한 정사” (similar enough in bi-gram!) Surprisingly effective in some languages


10

Text Processing (2) – Storing indexing results

A BE

A CF

C

F

A DG

BG

1 2 3 4 n

1 2 3 4 … nA v v vB v vC v vD vE vF vG v v

Invertedindex


11

Text Processing (3) - Indexing

Use of various linguistic resources Dictionaries (noun, Josa, Eomi, bilingual, Proper noun, foreign w

ords, …) For extraction and weighting of index terms

Thesaurus (e.g. WordNet) Controlled vocabulary indexing Matching similar and related words

Tagged Corpus

Most NLP technology is used for term extraction “Bag of words” approach Sense disambiguation? Word order?


12

Overview of Text Retrieval


Search Engine


text 분석


rawtext

InfoNeeds

Index Query


Retrieval Result


13

User/System Interaction – Query Models

Boolean AND, OR, NOT operators

E.g (semi-conductor OR chip) AND stock NOT chocolate) adjacency, phrase operators

E.g: “stock exchange”, “ 그리고 아무 말도 하지 않았다” ) Difficult for naïve users visual query interface

Word list Vector space model system

E.g.: (semi-conductor chip stock) Often interpreted as a Boolean query in search engines

E.g. (semi-conductor OR chip OR stock)


14


“Natural Language” Query E.g.: “I want to get information about ski resorts in Kangwon-do

or in the Chungcheong area.” Limitations in NLP Various tricks

Query Expansion To resolve mismatches between query terms and index terms for

documents A variety of linguistic resources are used (e.g. synonym, foreign

word equivalence classes, bilingual dictionaries) Guide users to follow step-by-step instructions for detailed queries

“canned queries” (E.g.: “Ask Jeeves”) query templates


15

Ask Jeeves 화면


16


Relevance feedback “Similar Pages” in Web search engines From a simple query to better queries progressively

Limited recall capability of human beings Recognition of a relevant document is much easier. Intended to ease the difficulty of grasping the statistical prope

rties of the entire collection An indirect way of capturing the user needs

User profile To reflect user’s interest and orientation in interpreting user

queries Need to gather & analyze user log data and learn user model

s


17

User/System Interaction – Result Presentation

Information overload problem – too many retrieved A simple ranked list - title, author, URL, date, …

Method 1: Organizing the retrieved documents Result Clustering (E.g. Vivisimo) “Zoom-in” operation (E.g: Scatter & Gather)

Method 2: Visualizing the retrieved documents Overview of a large amount of information Visual expression of document properties E.g. TileBar


18

Scatter/Gather


19

Tile Bar


20

Result Clustering


21

Text Retrieval Overview


Search Engine


text Analysis


rawtext

InfoNeeds

Index Query


Retrieval Result


22

Matching & Ranking (1)

Inverted File, …

가구가야

.

.

.

신라.

.

.

.

.

.

호랑이

0.7

0.9

.

.

.

0.9

.

.

.

.

.

.

0.6

3

.

.

.

2

.

.

.

.

.

.

2

12345...

275276

.

.

.

.

10111012

12546...35....

14

Terms Wt Pointers

Directory Posting file

Doc #1---------------

Doc #2---------------

Doc #5---------------

...

...

Query


23


Ranking Retrieval Model

Boolean (exact) => Fuzzy Set (inexact) Vector Space Probabilistic Inference Net …

Weighting Schemes Index terms, query terms Parameters in formulas Document characteristics …


24

IR Model Example: Vector Space Model

<DOC 1>... cat ........ dog ......................dog................mouse .....dog........mouse ........................

Q = < cat, mouse, 0 >

Di = (di1, di2, ... , din)Q = (q1, q2, ... , qn)Similarity = Di . Q / |Di|*|Q|

dog

mouse

cat

D1

Q


25


Techniques for efficiency New storage structure esp. for new document types Use of accumulators for efficient generation of ranked output Compression/decompression of indexes

Technique for Web search engines Use of hyperlinks

Inlinks & outlinks Authority vs hub pages

In conjunction with Directory Services (e.g. Yahoo) Softbot – storing terabytes of data and efficient crawling ...


26

Web document retrieval – using hyperlinks

TERM

InitialRetrievalSet

Candidates foradditional retrieval

To be rankedagain using thelink information

A

BC

A: Hub documentB: Authority document Increase the weight of A, B


27

Characteristics of IR - summary

Exact Match

Structured

Retrieval ModelsProbabilistic Deterministic

IndexingDerived from contents

Complete Items

Matching/RetrievalPartial or “Best” Match

Query TypesNatural Language

Results CriteriaRelevance Any Match

Results OrderingRanked Arbitrary

Information Retrieval

Data Retrieval

Information Retrieval/Data Retrieval Spectrum

Unstructured vs Structured


28

Overall Directions (1)

Efforts to improve retrieval effectiveness (as always!) Retrieval model, text analysis and representation, user interactions,

... Specialized Search: domain-specific

Context awareness (personalization, task-centered) profile, session logs, task models, etc.

Multi-something multimedia, multi-style, multilingual,…

Distributed Environment with a large quantity Web search, meta-search, distributed retrieval (DB segmentation),

meta-data retrieval, semantic Web New functionality

filtering, TDT, classification, summarization, QA, information extraction, ...


29

Cross-Language & Multilingual IR

Cross-language IR Using a language (mother tongue) to retrieve documents in a

nother language To overcome the language differences [terminology] cross-lingual, translingual (DARPA)

Multilingual IR “retrieving relevant document in any of the languages contai

ned in a multilingual document collection” (CL+ | ML) Document Retrieval E.g.: Using Korean queries to search a DB consisting of Kor

ean, English and Japanese documents


30

CLIR

The number of documents in languages other than own is rapidly increasing, and so is the need for retrieval. The rate of annual increase for documents in the WEB

English: 50%; All other languages: 90%

Multilingual countries, organizations, enterprises, & users The limitation of machine translation technology

More economical to translate necessary document after retrieval Not easy to construct a query in a foreign language even with the

ability to comprehend written materials in the language reading vs. writing

CLIR is fundamental to other multilingual information access technology


31

CLIR Problem (example)

현대자동차 주식 동향

same villageeastern exposuretrend

principal foodstocksfood and drink

??

현 대자 동차 현대 자동차

3 4


32

Retrieval of Structured Documents

XML documents, hypertext, metadata, semantic Web Queries for structure and content

FIND a document that INCLUDES a chapter whose title

CONTAINS the term “hypertext” AND whose section CONTAINS the term “browsing”.

Queries for content and link FIND all documents about “information retrieval” that is referred t

o by a paper written by “Myaeng”.

Retrieval with ontology vs retrieval from ontology (e.g. RQL)


33

Classification - Motivation

Sender: [email protected]: Your Business Listing - Global Trade IndexDate: 2003-09-03 ( 수 ) 오전 6:40Size: 6 KB

Dear Site Owner,You are invited to list your site at the most important Trade Directory on the Internet. This directory sy

stem has attributes no other directory on the Internet has had do date, check us out! Manufacturers - Wholesalers - Distributors - Resellers

and all businesses that are associated are welcome on our directory.

Your business will prosper from its association with our global resources!There are NO CHARGES for a listing. Just click here and

enter your business details as you like. Thank you for your kind assistance in this matter,

The Team [email protected] Free - 866 516-8412

Is this spam?

http://www.economicgrowthnetwork.com/

http://www.economicgrowthnetwork.com/advertising.cfm


34

Classification - Motivation

Re: 요청하신 자료입니다 . 그렇게 가면 안되지 . Get V^iagram in the convenience of your home Generi.c Cia.lis – Lasts 2 times longer then Via.gra! Re,no-va”te”*”you:r ‘d*ow_nst>;airs;^-

How about these subject lines?


35

Classification Problem

E-mail classification Given an e-mail message containing question and/or complai

nt, where should it be sent in ERMS? Categories:

AS, subscription/unsubscription, passwords, upgrades, usage, about-products, other questions, other complaints

A not-so-easy example:

제목 : 아뒤랑 비번를 잊어먹었슴다 .본문안녕하세여…오랫만에 홈에 들렀더니…제가 그만 아뒤랑 pass 를 잊어버렸네요 음…


36

Problem Statement

Given: A description of an instance, xX, where X is the instance l

anguage or instance space. A fixed set of categories:

C = {c1, c2,…, cn}

Determine: The category of x: c(x)C, where c(x) is a categorization fu

nction whose domain is X and whose range is C.

How to represent text documents?

how to build categorization functions (“classifiers”).


37

A Typical Example for Document Classification

Multimedia GUIGarb.Coll.SemanticsML Planning

planningtemporalreasoningplanlanguage...

programmingsemanticslanguageproof...

learningintelligencealgorithmreinforcementnetwork...

garbagecollectionmemoryoptimizationregion...

“planning language proof intelligence”

TrainingData:

TestingData:

Classes:(AI) (Programming) (HCI)

... ...


38

Classification Methods

Rule-based Methods E.g.: assign a category if document contains a given Boolean com

bination of words Accuracy is often very high if a profile has been carefully refined

over time by a subject expert. Building and maintaining these profiles is expensive.

Inductive Learning Models Naïve Bayesian Model Decision Tree Model SVM (Support Vector Machine)

Similarity-Based Models K-Nearest Neighbor Rocchio’s Model

These require hand-classified training data, but can be built (and refined) easily.


39

Topic Detection & Tracking (TDT) (1)

Event: A reported occurrence at a specific time and place, and the unavoidable consequences.

Specific elections, accidents, crimes, natural disasters. “TWA-800 airplane crash” vs. “airplane accidents”

Activity: A connected set of actions that have a common focus or purpose

campaigns, investigations, disaster relief efforts

Topic: a seminal event or activity, along with all directly related events and activities

Story: a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single topic


40

TDT (2) – First Story Detection

Automatically identify the first story on a new event from a stream of text

To detect the first story that discusses a topic, for all topics.

Time

First Stories

Not First Stories

= Topic 1= Topic 2


41

TDT (3) – More about FSD

First story detection is an unsupervised learning task. There is no supervised training.

On-line vs. Retrospective On-line: Flag onset of new events from live news feeds as

stories come in Retrospective: Detection consists of identifying first story

looking back over longer period Lack of advance knowledge of new events, but have

access to unlabeled historical data as a contrast set Applications

Intelligence services Finance: Be the first to trade a stock


42

TDT (4) – Other Tasks

Topic Tracking Once a topic has been detected, identify subsequent stories

about it Standard text classification task However, very small training set (initially: 1!)

Topic Detection Grouping stories from an accumulated collection Event-based classification with multiple topics (events) Retrospective


43

TDT (5)

Characteristics of Events News articles on an event are temporally close to each other.

lexical and temporal similarities “Similar” news articles over an extended time period Different

events Use a time window to determine the scope of an event

Changes in the used terms and their frequencies new event Use of clustering techniques (example)

retrospective: bottom-up clustering with a time window (TD) Online (Tracking)

single-pass, incremental clustering incremental IDF use of decaying function for old documents in computing the

similarity


44

Recommender Systems (1)

Recommender systems are a technological proxy for a social process We rely on recommendations from other people. An information discovery model where people try to find

other people with similar tastes and then ask them to suggest new things

In a typical recommender system People provide recommendations (input) The system aggregates and directs to appropriate recipients.


45

Recommender Systems (2)

Motivations Can we automatically aggregate quotes like:

"I like this book; you might be interested in it" "I saw this movie, you’ll like it“ "Don’t go see that movie!“

Finding new books, music, or movies, previously unknown to users

Applications Corporate Intranets

Recommendations, finding domain experts, … Ecommerce

Product recommendations – amazon, CDNOW, … Medical Applications

Matching patients to doctors, clinical trials, … Customer Relationship Management

Matching customer problems to internal experts in a support organization


46

Recommender Systems (3) - Types

Collaborative/Social-filtering system – aggregation of consumers’ preferences and recommendations to other users based on similarity in behavioral patterns

Content-based system – supervised machine learning used to induce a classifier to discriminate between interesting and uninteresting items for the user

Knowledge-based system – knowledge about users and products used to reason what meets the user’s requirements, using discrimination tree, decision support tools, case-based reasoning (CBR)


47

Recommender Systems (4) - Example


48

Automatic Summarization (1)

Functionality indicative: to determine if the document would be of any interest informative: to reflect the original content as faithfully as possible under t

he compression rate evaluative: evaluation of the original document

Fluency fragmented connected text

Users generic user (query)- focused

Target Documents single vs multiple documents


49

Automatic Summarization (2)

Word FrequenciesClue PhrasesLayoutSyntaxSemanticsDiscoursePragmatics

Word CountClue PhrasesStatisticalStructural

AbstractionAggregation

PlanningRealizationLayout

Source(s)Summary

Analysis Selection Condensation Presentation

IntermediaryRepresentation


50

Automatic Summarization (3) - Approaches

Natural language understanding / generation Build knowledge representation of text Generate sentences summarizing content Hard to do well

Keyword summaries Display most significant keywords Easy to do Hard to read, poor representation of content

Sentence extraction Extract key sentences Medium hard Summaries often don’t read well Good representation of content


51

Question Answering (Q/A) (1)

To provide an answer to a query, as opposed to a document Query: for factoid or exact answer Result: Ranked list of <document, answer string> pairs

Answer string: 50-250 bytes Documents: to supplement the answer

Example from TREC-9 How much folic acid should an expectant mother get daily? Who invented the paper clip? Where is Rider College located? Name a film in which Jude Law acted.


52

Question Answering (2)

System Flow (example)

QueryAnalyzer

QueryAnalyzer

DocAnalyzer

DocAnalyzer

RetrievalEngine

RetrievalEngine

AnswerExtractorAnswer

Extractor

RetrievedDocs

Candidates

QuerySet

QueryCategories

`

Thesaurus

Rule SetDocument

Set

UserQuery

Answer


53

Question Answering (3)

What is the fare cost for the round tripbetween New York andLondon on Concorde?

What [be] [ADJ] [NOUN] for

Query

RuleApplied

fare costfare cost

Financial lossFinancial lossCategorizationOf the phrase

Main phraseExtracted

Assign aQuery category

Assignment of query categories


54

Information Extraction

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Extracting Job Openings from the Web


55

Information Extraction (2)

Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources

Identify specific pieces of information in a un-structured or semi-structured textual document.

Transform this unstructured information into structured relations in a database/ontology.

...... 일본 동경에서의 테러 ...

...............................................4 월 11 일 오후 ..... 상무성 장관....... 동경시 ... 사제폭탄 .........................................................행인 5 명 중상 ..... 자동차 2 대 ........... 피해 .......

....... 독일에서의 경우 ............

..........

사건 테러일시 4.12 오후장소 동경시

목표 상무성장관

인명 피해 5 명 중상재산 피해 자동차 2 대. . .

. . .

DB


56

Information Extraction (3) – flow (example)

document

lexical analysislexical analysis

name recognitionname recognition

partial syntactic analysispartial syntactic analysis

scenario pattern matchingscenario pattern matching

coreference analysiscoreference analysis

template generationtemplate generation

inferenceinference

extracted templates

discourse analysis

local text analysis

[Grishman, 1997]


57

Information Extraction:MUC (State of the Art – 1997)

NE – named entity recognitionCO – coreference resolutionTE – template element constructionTR – template relation constructionST – scenario template production


58

Knowledge Extraction Vision

Multi-dimensional

Meta-data Extraction

J F M A M J J A

EMPLOYEE / EMPLOYER Relationships:Jan Clesius works for Clesius EnterprisesBill Young works for InterMedia Inc.COMPANY / LOCATION Relationshis:Clesius Enterprises is in New York, NYInterMedia Inc. is in Boston, MA

Meta-Data

India Bombing NY Times Andhra Bhoomi Dinamani Dainik Jagran

Topic Discovery

Concept Indexing

Thread Creation

Term Translation

Document Translation

Story Segmentation

Entity Extraction

Fact Extraction


59

References

Belkin, N.J. & Croft, W.B. (1987). Retrieval Techniques. In: Williams, M.E., Annual Review of Information Science and Technology 22(), 109-145, New York: Elsevier & ASIS.

E. Glover, S. Lawrence, M. Gordon, W. Birmingham, Lee-Giles (2001). Web Search – Your Way. Comm. of the ACM, 44 (12), 97-102.

Gregory Grefenstette (1998). “The Problem of Cross-Language Information Retrieval.” In Cross-Language Information Retrieval (ed: Grefenstette), Kluwer Academic Publishers.

Doug Oard et al. (1999). “Multilingual Information Discovery and AccesS (MIDAS).” D-Lib Magazine, 5 (10), Oct.

Sung Hyon Myaeng et al. (1998). “ A Flexible Model for Retrieval of SGML Documents.” Proc. of the 21st ACM SIGIR Conference, Austrailia.


60

References

James Allan (2002). “Introduction to Topic Detection and Tracking.” in Topic Detection and Tracking: Event-based Information Organization (ed: Allan), Kluwer Academic Publishers.

Paul Resnick & Hal Varian (1997). “Recommender Systems.” CACM 40 (3), March, pp 56-58.

Bardrul Sarwar et al. (2001). “Item-based Collaborative Recommendation Algorithms”, http://citeseer.nj.nec.com/sarwar01itembased.html

Karen Sparck Jones (1999). “Automatic summarizing: factors and directions.” In Advances in Automatic Text Summarization (eds: Mani & Maybury), MIT Press.

Ellen Boorhees. (2000). “Overview of TREC-9 Question Answering Track.” Ralph Grishman (1997). “Information Extraction: Techniques and Challenge

s.” In Information Extraction - International Summer School SCIE-97, (ed: Maria Teresa Pazienza), Springer-Verlag, 1997. (See http://nlp.cs.nyu.edu/publication/index.shtml)

Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.

http://citeseer.nj.nec.com/sarwar01itembased.html

http://nlp.cs.nyu.edu/publication/index.shtml