42
Information Retrieval (13) Prof. Dragomir R. Radev [email protected]

Information Retrieval (13) Prof. Dragomir R. Radev [email protected]

Embed Size (px)

Citation preview

Page 1: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Information Retrieval(13)

Prof. Dragomir R. Radev

[email protected]

Page 2: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

IR Winter 2010

…23. Text summarization…

Page 3: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Separate presentation(SIGIR 2004 tutorial)

Page 4: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

IR Winter 2010

…24. Collaborative filtering. Recommendation systems.…

Page 5: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Examples• http://www.netflix.com

– Given “Pulp Fiction”, it recommends:• Apocalypse Now • Reservoir Dogs • Kill Bill: Vol. 1 • Kill Bill: Vol. 2 • American Beauty

• http://www.amazon.com– Given Philip Ball’s “Critical Mass”, here are Amazon’s

recommendations:• The Wisdom of Crowds by James Surowiecki • The Paradox of Choice: Why More Is Less by Barry Schwartz • Why Life Speeds Up As You Get Older: How Memory Shapes our Past by

Douwe Draaisma • Origin of Wealth: Evolution, Complexity, and the Radical Remaking of

Economics by Eric D. Beinhocker • Freakonomics [Revised and Expanded]: A Rogue Economist Explores the

Hidden Side of Everything by Steven D. Levitt

Page 6: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Examples

• http://www.pandora.com/• http://www.google.com/search?

hl=en&q=related:www.umich.edu/

• Main approaches: – Vector-based: represent each user as a

vector– Graph-based: using random walks on bipartite

graphs

Page 7: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

IR Winter 2010

…25. Burstiness Self-triggerability…

Slides by Zhuoran Chen

Page 8: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Burstiness

• Given the average per-document frequency of a word in a collection, can wee predict how many times it will appear in a document?

• Church example: how many instances of “Noriega” will we see in a document?

• The first occurrence depends on DF, but the second does not!

• The adaptive language model• The degree of adaptation depends on lexical content –

independent of the frequency.

“word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph” -- Church&Gale

Page 9: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

The 2-Poisson Model – Bookstein and Swanson

• Intuition: content-bearing words clustered in relevant documents; non-content words occur randomly.

• Methods: linear combination of Poisson distributions • The two-poisson model, surprisingly, could account for

the occupancy distribution of most words.

221

!)1(

!)( 1 e

ke

kkf

kk

Page 10: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Term Burstiness

• The definitions of word frequency– Term frequency or TF: count of occurrences in a given

document– Document frequency or DF: count of documents in a corpus that

a word occurs– Generalized document frequency or DFj: like DF but a word

must occurs at least j times• DF/N: Given a word, the chance we will see it in a

document (the p in Church2000).• ∑TF/N: Given a word, the average count we will see it in

a document• Given we have seen a word in one document, what’s the

chance that we will see it again?

Page 11: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Adaptive model

• Church’s formulas– Cache model

Pr(w) = λPrlocal(w) + (1-λ)Prglobal(w)– History-Test division; Positive and negative

adaptations Pr(+adapt) = Pr(w in test| w in history)Pr(-adapt) = Pr(w in test| w not in history)observation: Pr(+adapt) >> Pr(prior) > Pr(-adapt)

– Generalized DFdfj = number of documents with j or more instances of w.

12 /)1|2Pr()2Pr( dfdfkkadapt

Page 12: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

IR Winter 2010

…26. Information Extraction.

Hidden Markov Models. …

Page 13: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Information Extraction

• Extracting database records from unstructured and semi-structured inputs

• Examples:– Recognizing names of people in text– Extracting prices from tables– Linking companies with products– Identifying positive vs. negative opinions

• Main steps:– Segmentation– Classification– Association– Clustering

Page 14: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

FDA expands pet food recall

The nationwide pet food recall was expanded Wednesday to include products containing rice protein laced with melamine, a toxic agent, the Food and Drug Administration said.Before this latest announcement, the FDA attributed pet illness and deaths to recalled pet food with wheat gluten found to contain melamine, a component of fertilizers and plastic utensils.Also on Wednesday, Menu Foods, the company that recalled more than 60 million cans and pouches of wet cat and dog food on March 15, added one of its Natural Life brand products to its recall list. It added two product dates to eight of its already recalled pet foods.The FDA has recorded 16 animal deaths related to the wheat gluten-pet food recall. However, other organizations have put the death toll in the thousands. After consumer complaints to Natural Balance of Pacoima, California, reporting kidney failure in several cats and dogs after eating the company's venison products, the firm issued a nationwide recall of its venison and brown rice canned and bagged dog foods and treats, and venison and green pea dry cat food, the FDA said.

FDA – organizationFood and Drug Administration -

organizationMenu Foods – companyNatural Life – brandNatural Balance – companyPacoima – locationCalifornia - location

Page 15: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Landscape of IE Techniques

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Boundary Models

Abraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars

Abraham Lincoln was born in Kentucky.

NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

Our Focus today!

Slide by William Cohen

Page 16: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Markov Property

S2

S2S1

1/2

1/2 1/3

2/3

1

The state of a system at time t+1, qt+1, is conditionally independent of {qt-1, qt-2, …, q1, q0} given qt

In another word, current state determines the probability distribution for the next state.

S1: rainS2: cloudS3: sun

Slide by Yunyao Li

Page 17: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Markov Property

S2

S3S1

1/2

1/2 1/3

2/3

1

State-transition probabilities,

A =

S1: rainS2: cloudS3: sun

033.067.0

05.05.0

100

Q: given today is sunny (i.e., q1=3),what is the probability of “sun-cloud”with the model?

Slide by Yunyao Li

Page 18: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Hidden Markov ModelS1: rainS2: cloudS3: sun

S2

S3S1

1/2

1/21/3

2/3

14/5

1/10

7/101/5 3/10

9/10

observations

O1 O2 O3 O4 O5

state sequences

Slide by Yunyao Li

Page 19: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

IE with Hidden Markov Model

CS 6998 is held weekly at IPB.

CS 6998 is held weekly at IPB

Course name: CS 6998

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “course name”state extract as a course name:

),(maxarg osPs

Slide by Yunyao Li

course name

location name

background

person name

Page 20: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Named Entity Extraction[Bikel et al 1998]

Person

Org

Other

(Five other name classes)

start-of-sentence

end-of-sentence

Hidden states

Slide by Yunyao Li

Page 21: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Name Entity Extraction

Transitionprobabilities

Observationprobabilities

P(st | st-1, ot-1 ) P(ot | st , st-1 )

P(ot | st , ot-1 )or

(1) Generating first word of a name-class

(2) Generating the rest of words in the name-class

(3) Generating “+end+” in a name-class

Slide by Yunyao Li

Page 22: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Training: Estimating Probabilities

Slide by Yunyao Li

Page 23: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Back-Off“unknown words” and insufficient training data

P(st | st-1 )

P(st )

P(ot | st )

P(ot )

Transitionprobabilities

Observationprobabilities

Slide by Yunyao Li

Page 24: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

HMM-Experimental Results

Train on ~500k words of news wire text.

Results:

Slide by Yunyao Li

Page 25: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

IR Winter 2010

…27. Probabilistic models of IR Language models …

Slides by Manning, Schuetze, Raghavan

Page 26: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Probability Ranking Principle

Let x be a document in the collection. Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance.

)(

)()|()|(

)(

)()|()|(

xp

NRpNRxpxNRp

xp

RpRxpxRp

p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.

Need to find p(R|x) - probability that a document x is relevant.

p(R),p(NR) - prior probabilityof retrieving a (non) relevantdocument

1)|()|( xNRpxRp

R={0,1} vs. NR/R

Page 27: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Binary Independence Model

• Traditionally used in conjunction with PRP• “Binary” = Boolean: documents are represented as

binary incidence vectors of terms (cf. lecture 1):– – iff term i is present in document x.

• “Independence”: terms occur in documents independently

• Different documents can be modeled as same vector

• Bernoulli Naive Bayes model (cf. text categorization!)

),,( 1 nxxx

1ix

Page 28: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Binary Independence Model

• Queries: binary term incidence vectors• Given query q,

– for each document d need to compute p(R|q,d).– replace with computing p(R|q,x) where x is binary

term incidence vector representing d Interested only in ranking

• Will use odds and Bayes’ Rule:

)|(),|()|(

)|(),|()|(

),|(

),|(),|(

qxpqNRxpqNRp

qxpqRxpqRp

xqNRp

xqRpxqRO

Page 29: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Binary Independence Model

• Using Independence Assumption:

n

i i

i

qNRxp

qRxp

qNRxp

qRxp

1 ),|(

),|(

),|(

),|(

),|(

),|(

)|(

)|(

),|(

),|(),|(

qNRxp

qRxp

qNRp

qRp

xqNRp

xqRpxqRO

Constant for a given query

Needs estimation

n

i i

i

qNRxp

qRxpqROdqRO

1 ),|(

),|()|(),|(•So :

Page 30: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Binary Independence Model

n

i i

i

qNRxp

qRxpqROdqRO

1 ),|(

),|()|(),|(

• Since xi is either 0 or 1:

01 ),|0(

),|0(

),|1(

),|1()|(),|(

ii x i

i

x i

i

qNRxp

qRxp

qNRxp

qRxpqROdqRO

• Let );,|1( qRxpp ii );,|1( qNRxpr ii

• Assume, for all terms not occurring in the query (qi=0) ii rp

Then...This can be changed (e.g., inrelevance feedback)

Page 31: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

All matching terms Non-matching query terms

Binary Independence Model

All matching termsAll query terms

11

101

1

1

)1(

)1()|(

1

1)|(),|(

iii

i

iii

q i

i

qx ii

ii

qx i

i

qx i

i

r

p

pr

rpqRO

r

p

r

pqROxqRO

Page 32: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Binary Independence Model

Constant foreach query

Only quantity to be estimated for rankings

11 1

1

)1(

)1()|(),|(

iii q i

i

qx ii

ii

r

p

pr

rpqROxqRO

• Retrieval Status Value:

11 )1(

)1(log

)1(

)1(log

iiii qx ii

ii

qx ii

ii

pr

rp

pr

rpRSV

Page 33: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Binary Independence Model

• All boils down to computing RSV.

11 )1(

)1(log

)1(

)1(log

iiii qx ii

ii

qx ii

ii

pr

rp

pr

rpRSV

1

;ii qx

icRSV)1(

)1(log

ii

iii pr

rpc

So, how do we compute ci’s from our data ?

Page 34: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Binary Independence Model

• Estimating RSV coefficients.• For each term i look at this table of document counts:

Documens Relevant Non-Relevant Total

Xi=1 s n-s n

Xi=0 S-s N-n-S+s N-n

Total S N-S N

S

spi

)(

)(

SN

snri

)()(

)(log),,,(

sSnNsn

sSssSnNKci

• Estimates: For now,assume nozero terms.More inMSR12

Page 35: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

IR Winter 2010

…28. Adversarial IR. Spamming and anti-spamming methods. …

Page 36: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Adversarial IR

• We looked at spamming in the context of Naïve Bayes

• Let’s now consider spamming of hyperlinked IR• The main idea: artificially increase your in-

degree• Link farms: groups of pages that point to each

other.• Google penalizes sites that belong to link farms

Page 37: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

IR Winter 2010

…29. Human behavior on the Web. …

Page 38: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Sample tasks

• Identifying sessions in query logs

• Predicting accesses to a given page (e.g., for caching)

• Recognizing human vs. automated queries

Page 39: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Analysis of Search Engine Query Logs

# of Sample Query Source SE Time Period

Lau & Horvitz 4690 of 1 Million Excite Sep 1997

Silverstein et al 1 Billion AltaVista 6 weeks in Aug & Sep 1998

Spink et al (series of studies)

1Million for each time period

Excite Sep 1997Dec 1999May 2001

Xie & O’Hallaron

110,000 Vivisimo 35 days Jan & Feb 2001

1.9 Million Excite 8 hrs in a day, Dec 1999

This slide is from Pierre Baldi

Page 40: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Main Results• Average number of terms in a query is ranging

from a low of 2.2 to a high of 2.6• The most common number of terms in a query is

2• The majority of users don’t refine their query

– The number of users who viewed only a single page increase 29% (1997) to 51% (2001) (Excite)

– 85% of users viewed only first page of search results (AltaVista)

• 45% (2001) of queries is about Commerce, Travel, Economy, People (was 20%1997)– The queries about adult or entertainment decreased

from 20% (1997) to around 7% (2001)

This slide is from Pierre Baldi

Page 41: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Main Results

• All four studies produced a generally consistent set of findings about user behavior in a search engine context– most users view relatively few pages per query– most users don’t use advanced search features

- Query Length Distributions (bar)

- Poisson Model(dots & lines)

This slide is from Pierre Baldi

Page 42: Information Retrieval (13) Prof. Dragomir R. Radev radev@umich.edu

Power-law Characteristics

• Frequency f(r) of Queries with Rank r– 110000 queries from Vivisimo– 1.9 Million queries from Excite

• There are strong regularities in terms of patterns of behavior in how we search the Web

Power-Law in log-log space

This slide is from Pierre Baldi