Information Retrieval (13) Prof. Dragomir R. Radev [email protected]

Information Retrieval(13)

Prof. Dragomir R. Radev

[email protected]

IR Winter 2010

…23. Text summarization…

Separate presentation(SIGIR 2004 tutorial)

IR Winter 2010

…24. Collaborative filtering. Recommendation systems.…

Examples• http://www.netflix.com

– Given “Pulp Fiction”, it recommends:• Apocalypse Now • Reservoir Dogs • Kill Bill: Vol. 1 • Kill Bill: Vol. 2 • American Beauty

• http://www.amazon.com– Given Philip Ball’s “Critical Mass”, here are Amazon’s

recommendations:• The Wisdom of Crowds by James Surowiecki • The Paradox of Choice: Why More Is Less by Barry Schwartz • Why Life Speeds Up As You Get Older: How Memory Shapes our Past by

Douwe Draaisma • Origin of Wealth: Evolution, Complexity, and the Radical Remaking of

Economics by Eric D. Beinhocker • Freakonomics [Revised and Expanded]: A Rogue Economist Explores the

Hidden Side of Everything by Steven D. Levitt

http://www.netflix.com/

http://www.netflix.com/Movie/Apocalypse_Now/262839?trkid=174833

http://www.netflix.com/Movie/Reservoir_Dogs/902003?trkid=174833

http://www.netflix.com/Movie/Kill_Bill_Vol._1/60031236?trkid=174833

http://www.netflix.com/Movie/Kill_Bill_Vol._2/60032563?trkid=174833

http://www.netflix.com/Movie/American_Beauty/60000407?trkid=174833

http://www.amazon.com/

http://www.amazon.com/Wisdom-Crowds-James-Surowiecki/dp/0385721706/ref=pd_sim_b_1/104-4309854-7616719?ie=UTF8&qid=1176916677&sr=8-1

Examples

• http://www.pandora.com/• http://www.google.com/search?

hl=en&q=related:www.umich.edu/

• Main approaches: – Vector-based: represent each user as a

vector– Graph-based: using random walks on bipartite

graphs

IR Winter 2010

…25. Burstiness Self-triggerability…

Slides by Zhuoran Chen

Burstiness

• Given the average per-document frequency of a word in a collection, can wee predict how many times it will appear in a document?

• Church example: how many instances of “Noriega” will we see in a document?

• The first occurrence depends on DF, but the second does not!

• The adaptive language model• The degree of adaptation depends on lexical content –

independent of the frequency.

“word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph” -- Church&Gale

The 2-Poisson Model – Bookstein and Swanson

• Intuition: content-bearing words clustered in relevant documents; non-content words occur randomly.

• Methods: linear combination of Poisson distributions • The two-poisson model, surprisingly, could account for

the occupancy distribution of most words.

221

!)1(

!)( 1 e

ke

kkf

kk

Term Burstiness

• The definitions of word frequency– Term frequency or TF: count of occurrences in a given

document– Document frequency or DF: count of documents in a corpus that

a word occurs– Generalized document frequency or DFj: like DF but a word

must occurs at least j times• DF/N: Given a word, the chance we will see it in a

document (the p in Church2000).• ∑TF/N: Given a word, the average count we will see it in

a document• Given we have seen a word in one document, what’s the

chance that we will see it again?

Adaptive model

• Church’s formulas– Cache model

Pr(w) = λPrlocal(w) + (1-λ)Prglobal(w)– History-Test division; Positive and negative

adaptations Pr(+adapt) = Pr(w in test| w in history)Pr(-adapt) = Pr(w in test| w not in history)observation: Pr(+adapt) >> Pr(prior) > Pr(-adapt)

– Generalized DFdfj = number of documents with j or more instances of w.

12 /)1|2Pr()2Pr( dfdfkkadapt

IR Winter 2010

…26. Information Extraction.

Hidden Markov Models. …

Information Extraction

• Extracting database records from unstructured and semi-structured inputs

• Examples:– Recognizing names of people in text– Extracting prices from tables– Linking companies with products– Identifying positive vs. negative opinions

• Main steps:– Segmentation– Classification– Association– Clustering

FDA expands pet food recall

The nationwide pet food recall was expanded Wednesday to include products containing rice protein laced with melamine, a toxic agent, the Food and Drug Administration said.Before this latest announcement, the FDA attributed pet illness and deaths to recalled pet food with wheat gluten found to contain melamine, a component of fertilizers and plastic utensils.Also on Wednesday, Menu Foods, the company that recalled more than 60 million cans and pouches of wet cat and dog food on March 15, added one of its Natural Life brand products to its recall list. It added two product dates to eight of its already recalled pet foods.The FDA has recorded 16 animal deaths related to the wheat gluten-pet food recall. However, other organizations have put the death toll in the thousands. After consumer complaints to Natural Balance of Pacoima, California, reporting kidney failure in several cats and dogs after eating the company's venison products, the firm issued a nationwide recall of its venison and brown rice canned and bagged dog foods and treats, and venison and green pea dry cat food, the FDA said.

FDA – organizationFood and Drug Administration -

organizationMenu Foods – companyNatural Life – brandNatural Balance – companyPacoima – locationCalifornia - location

Landscape of IE Techniques

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates


Classifier

which class?

Sliding Window


Classifier

which class?

Try alternatewindow sizes:

Boundary Models


Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars


NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines


Most likely state sequence?

Our Focus today!

Slide by William Cohen

Markov Property

S2

S2S1

1/2

1/2 1/3

2/3

1

The state of a system at time t+1, qt+1, is conditionally independent of {qt-1, qt-2, …, q1, q0} given qt

In another word, current state determines the probability distribution for the next state.

S1: rainS2: cloudS3: sun

Slide by Yunyao Li

Markov Property

S2

S3S1

1/2

1/2 1/3

2/3

1

State-transition probabilities,

A =

S1: rainS2: cloudS3: sun

033.067.0

05.05.0

100

Q: given today is sunny (i.e., q1=3),what is the probability of “sun-cloud”with the model?

Slide by Yunyao Li

Hidden Markov ModelS1: rainS2: cloudS3: sun

S2

S3S1

1/2

1/21/3

2/3

14/5

1/10

7/101/5 3/10

9/10

observations

O1 O2 O3 O4 O5

state sequences

Slide by Yunyao Li

IE with Hidden Markov Model

CS 6998 is held weekly at IPB.

CS 6998 is held weekly at IPB

Course name: CS 6998

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “course name”state extract as a course name:

),(maxarg osPs

Slide by Yunyao Li

course name

location name

background

person name

Named Entity Extraction[Bikel et al 1998]

Person

Org

Other

(Five other name classes)

start-of-sentence

end-of-sentence

Hidden states

Slide by Yunyao Li

Name Entity Extraction

Transitionprobabilities

Observationprobabilities

P(st | st-1, ot-1 ) P(ot | st , st-1 )

P(ot | st , ot-1 )or

(1) Generating first word of a name-class

(2) Generating the rest of words in the name-class

(3) Generating “+end+” in a name-class

Slide by Yunyao Li

Training: Estimating Probabilities

Slide by Yunyao Li

Back-Off“unknown words” and insufficient training data

P(st | st-1 )

P(st )

P(ot | st )

P(ot )

Transitionprobabilities

Observationprobabilities

Slide by Yunyao Li

HMM-Experimental Results

Train on ~500k words of news wire text.

Results:

Slide by Yunyao Li

IR Winter 2010

…27. Probabilistic models of IR Language models …

Slides by Manning, Schuetze, Raghavan

Probability Ranking Principle

Let x be a document in the collection. Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance.

)(

)()|()|(

)(

)()|()|(

xp

NRpNRxpxNRp

xp

RpRxpxRp

p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.

Need to find p(R|x) - probability that a document x is relevant.

p(R),p(NR) - prior probabilityof retrieving a (non) relevantdocument

1)|()|( xNRpxRp

R={0,1} vs. NR/R

Binary Independence Model

• Traditionally used in conjunction with PRP• “Binary” = Boolean: documents are represented as

binary incidence vectors of terms (cf. lecture 1):– – iff term i is present in document x.

• “Independence”: terms occur in documents independently

• Different documents can be modeled as same vector

• Bernoulli Naive Bayes model (cf. text categorization!)

),,( 1 nxxx

1ix


• Queries: binary term incidence vectors• Given query q,

– for each document d need to compute p(R|q,d).– replace with computing p(R|q,x) where x is binary

term incidence vector representing d Interested only in ranking

• Will use odds and Bayes’ Rule:

)|(),|()|(

)|(),|()|(

),|(

),|(),|(

qxpqNRxpqNRp

qxpqRxpqRp

xqNRp

xqRpxqRO


• Using Independence Assumption:

n

i i

i

qNRxp

qRxp

qNRxp

qRxp

1 ),|(

),|(

),|(

),|(

),|(

),|(

)|(

)|(

),|(

),|(),|(

qNRxp

qRxp

qNRp

qRp

xqNRp

xqRpxqRO

Constant for a given query

Needs estimation

n

i i

i

qNRxp

qRxpqROdqRO

1 ),|(

),|()|(),|(•So :


n

i i

i

qNRxp

qRxpqROdqRO

1 ),|(

),|()|(),|(

• Since xi is either 0 or 1:

01 ),|0(

),|0(

),|1(

),|1()|(),|(

ii x i

i

x i

i

qNRxp

qRxp

qNRxp

qRxpqROdqRO

• Let );,|1( qRxpp ii );,|1( qNRxpr ii

• Assume, for all terms not occurring in the query (qi=0) ii rp

Then...This can be changed (e.g., inrelevance feedback)

All matching terms Non-matching query terms


All matching termsAll query terms

11

101

1

1

)1(

)1()|(

1

1)|(),|(

iii

i

iii

q i

i

qx ii

ii

qx i

i

qx i

i

r

p

pr

rpqRO

r

p

r

pqROxqRO


Constant foreach query

Only quantity to be estimated for rankings

11 1

1

)1(

)1()|(),|(

iii q i

i

qx ii

ii

r

p

pr

rpqROxqRO

• Retrieval Status Value:

11 )1(

)1(log

)1(

)1(log

iiii qx ii

ii

qx ii

ii

pr

rp

pr

rpRSV


• All boils down to computing RSV.

11 )1(

)1(log

)1(

)1(log

iiii qx ii

ii

qx ii

ii

pr

rp

pr

rpRSV

1

;ii qx

icRSV)1(

)1(log

ii

iii pr

rpc

So, how do we compute ci’s from our data ?


• Estimating RSV coefficients.• For each term i look at this table of document counts:

Documens Relevant Non-Relevant Total

Xi=1 s n-s n

Xi=0 S-s N-n-S+s N-n

Total S N-S N

S

spi

)(

)(

SN

snri

)()(

)(log),,,(

sSnNsn

sSssSnNKci

• Estimates: For now,assume nozero terms.More inMSR12

IR Winter 2010

…28. Adversarial IR. Spamming and anti-spamming methods. …

Adversarial IR

• We looked at spamming in the context of Naïve Bayes

• Let’s now consider spamming of hyperlinked IR• The main idea: artificially increase your in-

degree• Link farms: groups of pages that point to each

other.• Google penalizes sites that belong to link farms

IR Winter 2010

…29. Human behavior on the Web. …

Sample tasks

• Identifying sessions in query logs

• Predicting accesses to a given page (e.g., for caching)

• Recognizing human vs. automated queries

Analysis of Search Engine Query Logs

# of Sample Query Source SE Time Period

Lau & Horvitz 4690 of 1 Million Excite Sep 1997

Silverstein et al 1 Billion AltaVista 6 weeks in Aug & Sep 1998

Spink et al (series of studies)

1Million for each time period

Excite Sep 1997Dec 1999May 2001

Xie & O’Hallaron

110,000 Vivisimo 35 days Jan & Feb 2001

1.9 Million Excite 8 hrs in a day, Dec 1999

This slide is from Pierre Baldi

Main Results• Average number of terms in a query is ranging

from a low of 2.2 to a high of 2.6• The most common number of terms in a query is

2• The majority of users don’t refine their query

– The number of users who viewed only a single page increase 29% (1997) to 51% (2001) (Excite)

– 85% of users viewed only first page of search results (AltaVista)

• 45% (2001) of queries is about Commerce, Travel, Economy, People (was 20%1997)– The queries about adult or entertainment decreased

from 20% (1997) to around 7% (2001)


Main Results

• All four studies produced a generally consistent set of findings about user behavior in a search engine context– most users view relatively few pages per query– most users don’t use advanced search features

- Query Length Distributions (bar)

- Poisson Model(dots & lines)


Power-law Characteristics

• Frequency f(r) of Queries with Rank r– 110000 queries from Vivisimo– 1.9 Million queries from Excite

• There are strong regularities in terms of patterns of behavior in how we search the Web

Power-Law in log-log space


Documents

Information Retrieval (13) Prof. Dragomir R. Radev [email protected]