60
Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab http://ir.icu.ac.kr 맹 맹 맹

Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab 맹

Embed Size (px)

Citation preview

Page 1: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

Information RetrievalTutorial

2004. 2. 13

Information & Communications UniversityIR & NLP Lab

http://ir.icu.ac.kr맹 성 현

Page 2: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

2

Outline

What is Information Retrieval (IR)? Overview of Core IR Technology Overall Directions IR Expanded

CLIR/MLIR Classification Topic Detection & Tracking Recommender Systems Summarization Question Answering Information Extraction

Page 3: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

3

What is IR?

Traditional IR: Willow System Traditional IR: Willow System

Page 4: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

4

What is IR?

Google Web Search Engine

Page 5: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

5

What is IR?

Ask Jeeves

Page 6: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

6

IR & the Rest of the World

AICognitiveScience

Linguistics

InformationRetrieval

NaturalLanguageProcessing

HumanComputerInteraction

Com

pute

r Scie

nce

Libra

ry &

Info

S

cien

ce

StatisticsDB

Page 7: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

7

Evaluation of IR Systems

effectiveness “relevance”

precision: A / A+C recall: A / A+B

efficiency Interactive systems? Others?

0

0.10.2

0.30.4

0.50.6

0.70.8

0.9

System ASystem BSystem C

Ret

NOT

Ret

Rel A B

NOT

Rel C D

recall

pre

cis

ion

Page 8: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

8

Overview of Text Retrieval

Text Processing User/System Interaction

Search Engine

Matching(Inferencing)

Text Analysis

Analysis ofInfo Needs

Rawtext

InfoNeeds

Index Query

Knowledge Resources & Tools

Retrieval Result

Page 9: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

9

Text Processing (1) - Indexing

Extraction of index terms and computation of their weights Index terms: represent document content & separate documents

“economy” vs “computer” in a news article of Financial Times

Morphological Analysis (stemming in English) “ 벨기에는” (“ 벨기 +” 에는” ?), “ 문서내의” (“ 문서” +” 내의” ) “information”, “informed”, “informs”, “informative” Rule-based vs dictionary-based

n-gram “ 정보검색시스템” => “_ 정” , “ 정보” , “ 보검” , “ 검색” , … (bi-gram) “ 부정사” vs “ 부정한 정사” (similar enough in bi-gram!) Surprisingly effective in some languages

Page 10: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

10

Text Processing (2) – Storing indexing results

A BE

A CF

C

F

A DG

BG

1 2 3 4 n

1 2 3 4 … nA v v vB v vC v vD vE vF vG v v

Invertedindex

Page 11: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

11

Text Processing (3) - Indexing

Use of various linguistic resources Dictionaries (noun, Josa, Eomi, bilingual, Proper noun, foreign w

ords, …) For extraction and weighting of index terms

Thesaurus (e.g. WordNet) Controlled vocabulary indexing Matching similar and related words

Tagged Corpus

Most NLP technology is used for term extraction “Bag of words” approach Sense disambiguation? Word order?

Page 12: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

12

Overview of Text Retrieval

Text Processing User/System Interaction

Search Engine

Matching(Inferencing)

text 분석

Analysis ofInfo Needs

rawtext

InfoNeeds

Index Query

Knowledge Resources & Tools

Retrieval Result

Page 13: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

13

User/System Interaction – Query Models

Boolean AND, OR, NOT operators

E.g (semi-conductor OR chip) AND stock NOT chocolate) adjacency, phrase operators

E.g: “stock exchange”, “ 그리고 아무 말도 하지 않았다” ) Difficult for naïve users visual query interface

Word list Vector space model system

E.g.: (semi-conductor chip stock) Often interpreted as a Boolean query in search engines

E.g. (semi-conductor OR chip OR stock)

Page 14: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

14

User/System Interaction – Query Models

“Natural Language” Query E.g.: “I want to get information about ski resorts in Kangwon-do

or in the Chungcheong area.” Limitations in NLP Various tricks

Query Expansion To resolve mismatches between query terms and index terms for

documents A variety of linguistic resources are used (e.g. synonym, foreign

word equivalence classes, bilingual dictionaries) Guide users to follow step-by-step instructions for detailed queries

“canned queries” (E.g.: “Ask Jeeves”) query templates

Page 15: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

15

Ask Jeeves 화면

Page 16: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

16

User/System Interaction – Query Models

Relevance feedback “Similar Pages” in Web search engines From a simple query to better queries progressively

Limited recall capability of human beings Recognition of a relevant document is much easier. Intended to ease the difficulty of grasping the statistical prope

rties of the entire collection An indirect way of capturing the user needs

User profile To reflect user’s interest and orientation in interpreting user

queries Need to gather & analyze user log data and learn user model

s

Page 17: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

17

User/System Interaction – Result Presentation

Information overload problem – too many retrieved A simple ranked list - title, author, URL, date, …

Method 1: Organizing the retrieved documents Result Clustering (E.g. Vivisimo) “Zoom-in” operation (E.g: Scatter & Gather)

Method 2: Visualizing the retrieved documents Overview of a large amount of information Visual expression of document properties E.g. TileBar

Page 18: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

18

Scatter/Gather

Page 19: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

19

Tile Bar

Page 20: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

20

Result Clustering

Page 21: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

21

Text Retrieval Overview

Text Processing User/System Interaction

Search Engine

Matching(Inferencing)

text Analysis

Analysis ofInfo Needs

rawtext

InfoNeeds

Index Query

Knowledge Resources & Tools

Retrieval Result

Page 22: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

22

Matching & Ranking (1)

Inverted File, …

가구가야

.

.

.

신라.

.

.

.

.

.

호랑이

0.7

0.9

.

.

.

0.9

.

.

.

.

.

.

0.6

3

.

.

.

2

.

.

.

.

.

.

2

12345...

275276

.

.

.

.

10111012

12546...35....

14

Terms Wt Pointers

Directory Posting file

Doc #1---------------

Doc #2---------------

Doc #5---------------

...

...

Query

Page 23: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

23

Matching & Ranking (2)

Ranking Retrieval Model

Boolean (exact) => Fuzzy Set (inexact) Vector Space Probabilistic Inference Net …

Weighting Schemes Index terms, query terms Parameters in formulas Document characteristics …

Page 24: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

24

IR Model Example: Vector Space Model

<DOC 1>... cat ........ dog ......................dog................mouse .....dog........mouse ........................

Q = < cat, mouse, 0 >

Di = (di1, di2, ... , din)Q = (q1, q2, ... , qn)Similarity = Di . Q / |Di|*|Q|

dog

mouse

cat

D1

Q

Page 25: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

25

Matching & Ranking (3)

Techniques for efficiency New storage structure esp. for new document types Use of accumulators for efficient generation of ranked output Compression/decompression of indexes

Technique for Web search engines Use of hyperlinks

Inlinks & outlinks Authority vs hub pages

In conjunction with Directory Services (e.g. Yahoo) Softbot – storing terabytes of data and efficient crawling ...

Page 26: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

26

Web document retrieval – using hyperlinks

TERM

InitialRetrievalSet

Candidates foradditional retrieval

To be rankedagain using thelink information

A

BC

A: Hub documentB: Authority document Increase the weight of A, B

Page 27: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

27

Characteristics of IR - summary

Exact Match

Structured

Retrieval ModelsProbabilistic Deterministic

IndexingDerived from contents

Complete Items

Matching/RetrievalPartial or “Best” Match

Query TypesNatural Language

Results CriteriaRelevance Any Match

Results OrderingRanked Arbitrary

Information Retrieval

Data Retrieval

Information Retrieval/Data Retrieval Spectrum

Unstructured vs Structured

Page 28: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

28

Overall Directions (1)

Efforts to improve retrieval effectiveness (as always!) Retrieval model, text analysis and representation, user interactions,

... Specialized Search: domain-specific

Context awareness (personalization, task-centered) profile, session logs, task models, etc.

Multi-something multimedia, multi-style, multilingual,…

Distributed Environment with a large quantity Web search, meta-search, distributed retrieval (DB segmentation),

meta-data retrieval, semantic Web New functionality

filtering, TDT, classification, summarization, QA, information extraction, ...

Page 29: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

29

Cross-Language & Multilingual IR

Cross-language IR Using a language (mother tongue) to retrieve documents in a

nother language To overcome the language differences [terminology] cross-lingual, translingual (DARPA)

Multilingual IR “retrieving relevant document in any of the languages contai

ned in a multilingual document collection” (CL+ | ML) Document Retrieval E.g.: Using Korean queries to search a DB consisting of Kor

ean, English and Japanese documents

Page 30: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

30

CLIR

The number of documents in languages other than own is rapidly increasing, and so is the need for retrieval. The rate of annual increase for documents in the WEB

English: 50%; All other languages: 90%

Multilingual countries, organizations, enterprises, & users The limitation of machine translation technology

More economical to translate necessary document after retrieval Not easy to construct a query in a foreign language even with the

ability to comprehend written materials in the language reading vs. writing

CLIR is fundamental to other multilingual information access technology

Page 31: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

31

CLIR Problem (example)

현대자동차 주식 동향

same villageeastern exposuretrend

principal foodstocksfood and drink

??

현 대자 동차 현대 자동차

3 4

Page 32: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

32

Retrieval of Structured Documents

XML documents, hypertext, metadata, semantic Web Queries for structure and content

FIND a document that INCLUDES a chapter whose title

CONTAINS the term “hypertext” AND whose section CONTAINS the term “browsing”.

Queries for content and link FIND all documents about “information retrieval” that is referred t

o by a paper written by “Myaeng”.

Retrieval with ontology vs retrieval from ontology (e.g. RQL)

Page 33: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

33

Classification - Motivation

Sender: [email protected]: Your Business Listing - Global Trade IndexDate: 2003-09-03 ( 수 ) 오전 6:40Size: 6 KB

Dear Site Owner,You are invited to list your site at the most important Trade Directory on the Internet. This directory sy

stem has attributes no other directory on the Internet has had do date, check us out!  Manufacturers - Wholesalers - Distributors - Resellers 

and all businesses that are associated are welcome on our directory. 

Your business will prosper from its association with our global resources!There are NO CHARGES for a listing. Just click here and 

enter your business  details as you like. Thank you for your kind assistance in this matter,

The Team [email protected] Free - 866 516-8412

Is this spam?

Page 34: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

34

Classification - Motivation

Re: 요청하신 자료입니다 . 그렇게 가면 안되지 . Get V^iagram in the convenience of your home Generi.c Cia.lis – Lasts 2 times longer then Via.gra! Re,no-va”te”*”you:r ‘d*ow_nst>;airs;^-

How about these subject lines?

Page 35: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

35

Classification Problem

E-mail classification Given an e-mail message containing question and/or complai

nt, where should it be sent in ERMS? Categories:

AS, subscription/unsubscription, passwords, upgrades, usage, about-products, other questions, other complaints

A not-so-easy example:

제목 : 아뒤랑 비번를 잊어먹었슴다 .본문안녕하세여…오랫만에 홈에 들렀더니…제가 그만 아뒤랑 pass 를 잊어버렸네요 음…

Page 36: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

36

Problem Statement

Given: A description of an instance, xX, where X is the instance l

anguage or instance space. A fixed set of categories:

C = {c1, c2,…, cn}

Determine: The category of x: c(x)C, where c(x) is a categorization fu

nction whose domain is X and whose range is C.

How to represent text documents?

how to build categorization functions (“classifiers”).

Page 37: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

37

A Typical Example for Document Classification

Multimedia GUIGarb.Coll.SemanticsML Planning

planningtemporalreasoningplanlanguage...

programmingsemanticslanguageproof...

learningintelligencealgorithmreinforcementnetwork...

garbagecollectionmemoryoptimizationregion...

“planning language proof intelligence”

TrainingData:

TestingData:

Classes:(AI) (Programming) (HCI)

... ...

Page 38: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

38

Classification Methods

Rule-based Methods E.g.: assign a category if document contains a given Boolean com

bination of words Accuracy is often very high if a profile has been carefully refined

over time by a subject expert. Building and maintaining these profiles is expensive.

Inductive Learning Models Naïve Bayesian Model Decision Tree Model SVM (Support Vector Machine)

Similarity-Based Models K-Nearest Neighbor Rocchio’s Model

These require hand-classified training data, but can be built (and refined) easily.

Page 39: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

39

Topic Detection & Tracking (TDT) (1)

Event: A reported occurrence at a specific time and place, and the unavoidable consequences.

Specific elections, accidents, crimes, natural disasters. “TWA-800 airplane crash” vs. “airplane accidents”

Activity: A connected set of actions that have a common focus or purpose

campaigns, investigations, disaster relief efforts

Topic: a seminal event or activity, along with all directly related events and activities

Story: a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single topic

Page 40: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

40

TDT (2) – First Story Detection

Automatically identify the first story on a new event from a stream of text

To detect the first story that discusses a topic, for all topics.

Time

First Stories

Not First Stories

= Topic 1= Topic 2

Page 41: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

41

TDT (3) – More about FSD

First story detection is an unsupervised learning task. There is no supervised training.

On-line vs. Retrospective On-line: Flag onset of new events from live news feeds as

stories come in Retrospective: Detection consists of identifying first story

looking back over longer period Lack of advance knowledge of new events, but have

access to unlabeled historical data as a contrast set Applications

Intelligence services Finance: Be the first to trade a stock

Page 42: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

42

TDT (4) – Other Tasks

Topic Tracking Once a topic has been detected, identify subsequent stories

about it Standard text classification task However, very small training set (initially: 1!)

Topic Detection Grouping stories from an accumulated collection Event-based classification with multiple topics (events) Retrospective

Page 43: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

43

TDT (5)

Characteristics of Events News articles on an event are temporally close to each other.

lexical and temporal similarities “Similar” news articles over an extended time period Different

events Use a time window to determine the scope of an event

Changes in the used terms and their frequencies new event Use of clustering techniques (example)

retrospective: bottom-up clustering with a time window (TD) Online (Tracking)

single-pass, incremental clustering incremental IDF use of decaying function for old documents in computing the

similarity

Page 44: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

44

Recommender Systems (1)

Recommender systems are a technological proxy for a social process We rely on recommendations from other people. An information discovery model where people try to find

other people with similar tastes and then ask them to suggest new things

In a typical recommender system People provide recommendations (input) The system aggregates and directs to appropriate recipients.

Page 45: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

45

Recommender Systems (2)

Motivations Can we automatically aggregate quotes like:

"I like this book; you might be interested in it" "I saw this movie, you’ll like it“ "Don’t go see that movie!“

Finding new books, music, or movies, previously unknown to users

Applications Corporate Intranets

Recommendations, finding domain experts, … Ecommerce

Product recommendations – amazon, CDNOW, … Medical Applications

Matching patients to doctors, clinical trials, … Customer Relationship Management

Matching customer problems to internal experts in a support organization

Page 46: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

46

Recommender Systems (3) - Types

Collaborative/Social-filtering system – aggregation of consumers’ preferences and recommendations to other users based on similarity in behavioral patterns

Content-based system – supervised machine learning used to induce a classifier to discriminate between interesting and uninteresting items for the user

Knowledge-based system – knowledge about users and products used to reason what meets the user’s requirements, using discrimination tree, decision support tools, case-based reasoning (CBR)

Page 47: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

47

Recommender Systems (4) - Example

Page 48: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

48

Automatic Summarization (1)

Functionality indicative: to determine if the document would be of any interest informative: to reflect the original content as faithfully as possible under t

he compression rate evaluative: evaluation of the original document

Fluency fragmented connected text

Users generic user (query)- focused

Target Documents single vs multiple documents

Page 49: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

49

Automatic Summarization (2)

Word FrequenciesClue PhrasesLayoutSyntaxSemanticsDiscoursePragmatics

Word CountClue PhrasesStatisticalStructural

AbstractionAggregation

PlanningRealizationLayout

Source(s)Summary

Analysis Selection Condensation Presentation

IntermediaryRepresentation

Page 50: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

50

Automatic Summarization (3) - Approaches

Natural language understanding / generation Build knowledge representation of text Generate sentences summarizing content Hard to do well

Keyword summaries Display most significant keywords Easy to do Hard to read, poor representation of content

Sentence extraction Extract key sentences Medium hard Summaries often don’t read well Good representation of content

Page 51: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

51

Question Answering (Q/A) (1)

To provide an answer to a query, as opposed to a document Query: for factoid or exact answer Result: Ranked list of <document, answer string> pairs

Answer string: 50-250 bytes Documents: to supplement the answer

Example from TREC-9 How much folic acid should an expectant mother get daily? Who invented the paper clip? Where is Rider College located? Name a film in which Jude Law acted.

Page 52: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

52

Question Answering (2)

System Flow (example)

QueryAnalyzer

QueryAnalyzer

DocAnalyzer

DocAnalyzer

RetrievalEngine

RetrievalEngine

AnswerExtractorAnswer

Extractor

RetrievedDocs

Candidates

QuerySet

QueryCategories

`

Thesaurus

Rule SetDocument

Set

UserQuery

Answer

Page 53: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

53

Question Answering (3)

What is the fare cost for the round tripbetween New York andLondon on Concorde?

What [be] [ADJ] [NOUN] for

Query

RuleApplied

fare costfare cost

Financial lossFinancial lossCategorizationOf the phrase

Main phraseExtracted

Assign aQuery category

Assignment of query categories

Page 54: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

54

Information Extraction

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Extracting Job Openings from the Web

Page 55: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

55

Information Extraction (2)

Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources

Identify specific pieces of information in a un-structured or semi-structured textual document.

Transform this unstructured information into structured relations in a database/ontology.

...... 일본 동경에서의 테러 ...

...............................................4 월 11 일 오후 ..... 상무성 장관....... 동경시 ... 사제폭탄 .........................................................행인 5 명 중상 ..... 자동차 2 대 ........... 피해 .......

....... 독일에서의 경우 ............

..........

사건 테러일시 4.12 오후장소 동경시

목표 상무성장관

인명 피해 5 명 중상재산 피해 자동차 2 대. . .

. . .

DB

Page 56: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

56

Information Extraction (3) – flow (example)

document

lexical analysislexical analysis

name recognitionname recognition

partial syntactic analysispartial syntactic analysis

scenario pattern matchingscenario pattern matching

coreference analysiscoreference analysis

template generationtemplate generation

inferenceinference

extracted templates

discourse analysis

local text analysis

[Grishman, 1997]

Page 57: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

57

Information Extraction:MUC (State of the Art – 1997)

NE – named entity recognitionCO – coreference resolutionTE – template element constructionTR – template relation constructionST – scenario template production

Page 58: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

58

Knowledge Extraction Vision

Multi-dimensional

Meta-data Extraction

J F M A M J J A

EMPLOYEE / EMPLOYER Relationships:Jan Clesius works for Clesius EnterprisesBill Young works for InterMedia Inc.COMPANY / LOCATION Relationshis:Clesius Enterprises is in New York, NYInterMedia Inc. is in Boston, MA

Meta-Data

India Bombing NY Times Andhra Bhoomi Dinamani Dainik Jagran

Topic Discovery

Concept Indexing

Thread Creation

Term Translation

Document Translation

Story Segmentation

Entity Extraction

Fact Extraction

Page 59: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

59

References

Belkin, N.J. & Croft, W.B. (1987). Retrieval Techniques. In: Williams, M.E., Annual Review of Information Science and Technology 22(), 109-145, New York: Elsevier & ASIS.

E. Glover, S. Lawrence, M. Gordon, W. Birmingham, Lee-Giles (2001). Web Search – Your Way. Comm. of the ACM, 44 (12), 97-102.

Gregory Grefenstette (1998). “The Problem of Cross-Language Information Retrieval.” In Cross-Language Information Retrieval (ed: Grefenstette), Kluwer Academic Publishers.

Doug Oard et al. (1999). “Multilingual Information Discovery and AccesS (MIDAS).” D-Lib Magazine, 5 (10), Oct.

Sung Hyon Myaeng et al. (1998). “ A Flexible Model for Retrieval of SGML Documents.” Proc. of the 21st ACM SIGIR Conference, Austrailia.

Page 60: Copyright © 2004 Sung Hyon Myaeng Information Retrieval Tutorial 2004. 2. 13 Information & Communications University IR & NLP Lab  맹

Copyright © 2004 Sung Hyon Myaeng

60

References

James Allan (2002). “Introduction to Topic Detection and Tracking.” in Topic Detection and Tracking: Event-based Information Organization (ed: Allan), Kluwer Academic Publishers.

Paul Resnick & Hal Varian (1997). “Recommender Systems.” CACM 40 (3), March, pp 56-58.

Bardrul Sarwar et al. (2001). “Item-based Collaborative Recommendation Algorithms”, http://citeseer.nj.nec.com/sarwar01itembased.html

Karen Sparck Jones (1999). “Automatic summarizing: factors and directions.” In Advances in Automatic Text Summarization (eds: Mani & Maybury), MIT Press.

Ellen Boorhees. (2000). “Overview of TREC-9 Question Answering Track.” Ralph Grishman (1997). “Information Extraction: Techniques and Challenge

s.” In Information Extraction - International Summer School SCIE-97, (ed: Maria Teresa Pazienza), Springer-Verlag, 1997. (See http://nlp.cs.nyu.edu/publication/index.shtml)

Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.