Upload
oswald-richards
View
216
Download
0
Embed Size (px)
Citation preview
IR Winter 2010
…23. Text summarization…
Separate presentation(SIGIR 2004 tutorial)
IR Winter 2010
…24. Collaborative filtering. Recommendation systems.…
Examples• http://www.netflix.com
– Given “Pulp Fiction”, it recommends:• Apocalypse Now • Reservoir Dogs • Kill Bill: Vol. 1 • Kill Bill: Vol. 2 • American Beauty
• http://www.amazon.com– Given Philip Ball’s “Critical Mass”, here are Amazon’s
recommendations:• The Wisdom of Crowds by James Surowiecki • The Paradox of Choice: Why More Is Less by Barry Schwartz • Why Life Speeds Up As You Get Older: How Memory Shapes our Past by
Douwe Draaisma • Origin of Wealth: Evolution, Complexity, and the Radical Remaking of
Economics by Eric D. Beinhocker • Freakonomics [Revised and Expanded]: A Rogue Economist Explores the
Hidden Side of Everything by Steven D. Levitt
Examples
• http://www.pandora.com/• http://www.google.com/search?
hl=en&q=related:www.umich.edu/
• Main approaches: – Vector-based: represent each user as a
vector– Graph-based: using random walks on bipartite
graphs
IR Winter 2010
…25. Burstiness Self-triggerability…
Slides by Zhuoran Chen
Burstiness
• Given the average per-document frequency of a word in a collection, can wee predict how many times it will appear in a document?
• Church example: how many instances of “Noriega” will we see in a document?
• The first occurrence depends on DF, but the second does not!
• The adaptive language model• The degree of adaptation depends on lexical content –
independent of the frequency.
“word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph” -- Church&Gale
The 2-Poisson Model – Bookstein and Swanson
• Intuition: content-bearing words clustered in relevant documents; non-content words occur randomly.
• Methods: linear combination of Poisson distributions • The two-poisson model, surprisingly, could account for
the occupancy distribution of most words.
221
!)1(
!)( 1 e
ke
kkf
kk
Term Burstiness
• The definitions of word frequency– Term frequency or TF: count of occurrences in a given
document– Document frequency or DF: count of documents in a corpus that
a word occurs– Generalized document frequency or DFj: like DF but a word
must occurs at least j times• DF/N: Given a word, the chance we will see it in a
document (the p in Church2000).• ∑TF/N: Given a word, the average count we will see it in
a document• Given we have seen a word in one document, what’s the
chance that we will see it again?
Adaptive model
• Church’s formulas– Cache model
Pr(w) = λPrlocal(w) + (1-λ)Prglobal(w)– History-Test division; Positive and negative
adaptations Pr(+adapt) = Pr(w in test| w in history)Pr(-adapt) = Pr(w in test| w not in history)observation: Pr(+adapt) >> Pr(prior) > Pr(-adapt)
– Generalized DFdfj = number of documents with j or more instances of w.
12 /)1|2Pr()2Pr( dfdfkkadapt
IR Winter 2010
…26. Information Extraction.
Hidden Markov Models. …
Information Extraction
• Extracting database records from unstructured and semi-structured inputs
• Examples:– Recognizing names of people in text– Extracting prices from tables– Linking companies with products– Identifying positive vs. negative opinions
• Main steps:– Segmentation– Classification– Association– Clustering
FDA expands pet food recall
The nationwide pet food recall was expanded Wednesday to include products containing rice protein laced with melamine, a toxic agent, the Food and Drug Administration said.Before this latest announcement, the FDA attributed pet illness and deaths to recalled pet food with wheat gluten found to contain melamine, a component of fertilizers and plastic utensils.Also on Wednesday, Menu Foods, the company that recalled more than 60 million cans and pouches of wet cat and dog food on March 15, added one of its Natural Life brand products to its recall list. It added two product dates to eight of its already recalled pet foods.The FDA has recorded 16 animal deaths related to the wheat gluten-pet food recall. However, other organizations have put the death toll in the thousands. After consumer complaints to Natural Balance of Pacoima, California, reporting kidney failure in several cats and dogs after eating the company's venison products, the firm issued a nationwide recall of its venison and brown rice canned and bagged dog foods and treats, and venison and green pea dry cat food, the FDA said.
FDA – organizationFood and Drug Administration -
organizationMenu Foods – companyNatural Life – brandNatural Balance – companyPacoima – locationCalifornia - location
Landscape of IE Techniques
Lexicons
AlabamaAlaska…WisconsinWyoming
Abraham Lincoln was born in Kentucky.
member?
Classify Pre-segmentedCandidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Boundary Models
Abraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Context Free Grammars
Abraham Lincoln was born in Kentucky.
NNP V P NPVNNP
NP
PP
VP
VP
S
Mos
t lik
ely
pars
e?
Finite State Machines
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
Our Focus today!
Slide by William Cohen
Markov Property
S2
S2S1
1/2
1/2 1/3
2/3
1
The state of a system at time t+1, qt+1, is conditionally independent of {qt-1, qt-2, …, q1, q0} given qt
In another word, current state determines the probability distribution for the next state.
S1: rainS2: cloudS3: sun
Slide by Yunyao Li
Markov Property
S2
S3S1
1/2
1/2 1/3
2/3
1
State-transition probabilities,
A =
S1: rainS2: cloudS3: sun
033.067.0
05.05.0
100
Q: given today is sunny (i.e., q1=3),what is the probability of “sun-cloud”with the model?
Slide by Yunyao Li
Hidden Markov ModelS1: rainS2: cloudS3: sun
S2
S3S1
1/2
1/21/3
2/3
14/5
1/10
7/101/5 3/10
9/10
observations
O1 O2 O3 O4 O5
state sequences
Slide by Yunyao Li
IE with Hidden Markov Model
CS 6998 is held weekly at IPB.
CS 6998 is held weekly at IPB
Course name: CS 6998
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “course name”state extract as a course name:
),(maxarg osPs
Slide by Yunyao Li
course name
location name
background
person name
Named Entity Extraction[Bikel et al 1998]
Person
Org
Other
(Five other name classes)
start-of-sentence
end-of-sentence
Hidden states
Slide by Yunyao Li
Name Entity Extraction
Transitionprobabilities
Observationprobabilities
P(st | st-1, ot-1 ) P(ot | st , st-1 )
P(ot | st , ot-1 )or
(1) Generating first word of a name-class
(2) Generating the rest of words in the name-class
(3) Generating “+end+” in a name-class
Slide by Yunyao Li
Training: Estimating Probabilities
Slide by Yunyao Li
Back-Off“unknown words” and insufficient training data
P(st | st-1 )
P(st )
P(ot | st )
P(ot )
Transitionprobabilities
Observationprobabilities
Slide by Yunyao Li
HMM-Experimental Results
Train on ~500k words of news wire text.
Results:
Slide by Yunyao Li
IR Winter 2010
…27. Probabilistic models of IR Language models …
Slides by Manning, Schuetze, Raghavan
Probability Ranking Principle
Let x be a document in the collection. Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance.
)(
)()|()|(
)(
)()|()|(
xp
NRpNRxpxNRp
xp
RpRxpxRp
p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.
Need to find p(R|x) - probability that a document x is relevant.
p(R),p(NR) - prior probabilityof retrieving a (non) relevantdocument
1)|()|( xNRpxRp
R={0,1} vs. NR/R
Binary Independence Model
• Traditionally used in conjunction with PRP• “Binary” = Boolean: documents are represented as
binary incidence vectors of terms (cf. lecture 1):– – iff term i is present in document x.
• “Independence”: terms occur in documents independently
• Different documents can be modeled as same vector
• Bernoulli Naive Bayes model (cf. text categorization!)
),,( 1 nxxx
1ix
Binary Independence Model
• Queries: binary term incidence vectors• Given query q,
– for each document d need to compute p(R|q,d).– replace with computing p(R|q,x) where x is binary
term incidence vector representing d Interested only in ranking
• Will use odds and Bayes’ Rule:
)|(),|()|(
)|(),|()|(
),|(
),|(),|(
qxpqNRxpqNRp
qxpqRxpqRp
xqNRp
xqRpxqRO
Binary Independence Model
• Using Independence Assumption:
n
i i
i
qNRxp
qRxp
qNRxp
qRxp
1 ),|(
),|(
),|(
),|(
),|(
),|(
)|(
)|(
),|(
),|(),|(
qNRxp
qRxp
qNRp
qRp
xqNRp
xqRpxqRO
Constant for a given query
Needs estimation
n
i i
i
qNRxp
qRxpqROdqRO
1 ),|(
),|()|(),|(•So :
Binary Independence Model
n
i i
i
qNRxp
qRxpqROdqRO
1 ),|(
),|()|(),|(
• Since xi is either 0 or 1:
01 ),|0(
),|0(
),|1(
),|1()|(),|(
ii x i
i
x i
i
qNRxp
qRxp
qNRxp
qRxpqROdqRO
• Let );,|1( qRxpp ii );,|1( qNRxpr ii
• Assume, for all terms not occurring in the query (qi=0) ii rp
Then...This can be changed (e.g., inrelevance feedback)
All matching terms Non-matching query terms
Binary Independence Model
All matching termsAll query terms
11
101
1
1
)1(
)1()|(
1
1)|(),|(
iii
i
iii
q i
i
qx ii
ii
qx i
i
qx i
i
r
p
pr
rpqRO
r
p
r
pqROxqRO
Binary Independence Model
Constant foreach query
Only quantity to be estimated for rankings
11 1
1
)1(
)1()|(),|(
iii q i
i
qx ii
ii
r
p
pr
rpqROxqRO
• Retrieval Status Value:
11 )1(
)1(log
)1(
)1(log
iiii qx ii
ii
qx ii
ii
pr
rp
pr
rpRSV
Binary Independence Model
• All boils down to computing RSV.
11 )1(
)1(log
)1(
)1(log
iiii qx ii
ii
qx ii
ii
pr
rp
pr
rpRSV
1
;ii qx
icRSV)1(
)1(log
ii
iii pr
rpc
So, how do we compute ci’s from our data ?
Binary Independence Model
• Estimating RSV coefficients.• For each term i look at this table of document counts:
Documens Relevant Non-Relevant Total
Xi=1 s n-s n
Xi=0 S-s N-n-S+s N-n
Total S N-S N
S
spi
)(
)(
SN
snri
)()(
)(log),,,(
sSnNsn
sSssSnNKci
• Estimates: For now,assume nozero terms.More inMSR12
IR Winter 2010
…28. Adversarial IR. Spamming and anti-spamming methods. …
Adversarial IR
• We looked at spamming in the context of Naïve Bayes
• Let’s now consider spamming of hyperlinked IR• The main idea: artificially increase your in-
degree• Link farms: groups of pages that point to each
other.• Google penalizes sites that belong to link farms
IR Winter 2010
…29. Human behavior on the Web. …
Sample tasks
• Identifying sessions in query logs
• Predicting accesses to a given page (e.g., for caching)
• Recognizing human vs. automated queries
Analysis of Search Engine Query Logs
# of Sample Query Source SE Time Period
Lau & Horvitz 4690 of 1 Million Excite Sep 1997
Silverstein et al 1 Billion AltaVista 6 weeks in Aug & Sep 1998
Spink et al (series of studies)
1Million for each time period
Excite Sep 1997Dec 1999May 2001
Xie & O’Hallaron
110,000 Vivisimo 35 days Jan & Feb 2001
1.9 Million Excite 8 hrs in a day, Dec 1999
This slide is from Pierre Baldi
Main Results• Average number of terms in a query is ranging
from a low of 2.2 to a high of 2.6• The most common number of terms in a query is
2• The majority of users don’t refine their query
– The number of users who viewed only a single page increase 29% (1997) to 51% (2001) (Excite)
– 85% of users viewed only first page of search results (AltaVista)
• 45% (2001) of queries is about Commerce, Travel, Economy, People (was 20%1997)– The queries about adult or entertainment decreased
from 20% (1997) to around 7% (2001)
This slide is from Pierre Baldi
Main Results
• All four studies produced a generally consistent set of findings about user behavior in a search engine context– most users view relatively few pages per query– most users don’t use advanced search features
- Query Length Distributions (bar)
- Poisson Model(dots & lines)
This slide is from Pierre Baldi
Power-law Characteristics
• Frequency f(r) of Queries with Rank r– 110000 queries from Vivisimo– 1.9 Million queries from Excite
• There are strong regularities in terms of patterns of behavior in how we search the Web
Power-Law in log-log space
This slide is from Pierre Baldi