Upload
matteo
View
159
Download
2
Embed Size (px)
DESCRIPTION
龙星计划课程 : 信息检索 Statistical Language Models for IR. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign - PowerPoint PPT Presentation
Citation preview
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1
龙星计划课程 :信息检索 Statistical Language Models for IR
ChengXiang Zhai (翟成祥 ) Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology, StatisticsUniversity of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, [email protected]
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 2
Outline• More about statistical language models in general
• Systematic review of language models for IR– The basic language modeling approach – Advanced language models– KL-divergence retrieval model and feedback– Language models for special retrieval tasks
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 3
More about statistical language models in general
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 4
What is a Statistical LM?• A probability distribution over word sequences
– p(“Today is Wednesday”) 0.001– p(“Today Wednesday is”) 0.0000000000001– p(“The eigenvalue is positive”) 0.00001
• Context/topic dependent!
• Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 5
Why is a LM Useful?• Provides a principled way to quantify the
uncertainties associated with natural language
• Allows us to answer questions like:– Given that we see “John” and “feels”, how likely will we see “happy”
as opposed to “habit” as the next word? (speech recognition)
– Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval)
– Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 6
Source-Channel Framework(Model of Communication System [Shannon 48] )
Source Transmitter(encoder)
DestinationReceiver(decoder)
NoisyChannel
P(X)P(Y|X)
X Y X’P(X|Y)=?
)()|(maxarg)|(maxargˆ XpXYpYXpXXX
When X is text, p(X) is a language model
(Bayes Rule)
Many Examples: Speech recognition: X=Word sequence Y=Speech signal
Machine translation: X=English sentence Y=Chinese sentenceOCR Error Correction: X=Correct word Y= Erroneous wordInformation Retrieval: X=Document Y=QuerySummarization: X=Summary Y=Document
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 7
Basic Issues• Define the probabilistic model
– Event, Random Variables, Joint/Conditional Prob’s
– P(w1 w2 ... wn)=f(1, 2 ,…, m)
• Estimate model parameters– Tune the model to best fit the data and our prior
knowledge i=?
• Apply the model to a particular task– Many applications
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 8
The Simplest Language Model(Unigram Model)
• Generate a piece of text by generating each word independently
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)
• Essentially a multinomial distribution over words
• A piece of text can be regarded as a sample drawn according to this word distribution
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 9
Text Generation with Unigram LM (Unigram) Language Model p(w| )
…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…
Topic 1:Text mining
…food 0.25nutrition 0.1healthy 0.05diet 0.02…
Topic 2:Health
Document d
Text miningpaper
Food nutritionpaper
Sampling
Given , p(d| ) varies according to d
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 10
Estimation of Unigram LM(Unigram) Language Model p(w| )=? Document
text 10mining 5
association 3database 3algorithm 2
…query 1
efficient 1
…text ?mining ?assocation ?database ?…query ?…
Estimation
Total #words=100
10/1005/1003/1003/100
1/100
How good is the estimated model ?
It gives our document sample the highest prob,but it doesn’t generalize well… More about this later…
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 11
Empirical distribution of words• There are stable language-independent patterns in
how people use natural languages
• A few words occur very frequently; most occur rarely. E.g., in news articles,– Top 4 words: 10~15% word occurrences– Top 50 words: 35~40% word occurrences
• The most frequent word in one corpus may be rare in another
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 12
Zipf’s Law
• rank * frequency constant
WordFreq.
Word Rank (by Freq)
Most useful words (Luhn 57)
Biggestdata structure(stop words)
Is “too rare” a problem?
( ) 1, 0.1( )CF w C
r w
( )[ ( ) ]
CF wr w B
Generalized Zipf’s law: Applicable in many domains
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 13
More Sophisticated LMs• N-gram language models
– In general, p(w1 w2 ... wn)=p(w1)p(w2|w1)…p(wn|w1 …wn-1)
– n-gram: conditioned only on the past n-1 words– E.g., bigram: p(w1 ... wn)=p(w1)p(w2|w1) p(w3|w2) …p(wn|wn-1)
• Remote-dependence language models (e.g., Maximum Entropy model)
• Structured language models (e.g., probabilistic context-free grammar)
• Will not be covered in detail in this course. If interested, read [Jelinek 98, Manning & Schutze 99, Rosenfeld 00]
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 14
Why Just Unigram Models?• Difficulty in moving toward more complex models
– They involve more parameters, so need more data to estimate (A doc is an extremely small sample)
– They increase the computational complexity significantly, both in time and space
• Capturing word order or structure may not add so much value for “topical inference”
• But, using more sophisticated models can still be expected to improve performance ...
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 15
Evaluation of SLMs• Direct evaluation criterion: How well does the model fit the
data to be modeled? – Example measures: Data likelihood, perplexity, cross entropy,
Kullback-Leibler divergence (mostly equivalent)
• Indirect evaluation criterion: Does the model help improve the performance of the task?– Specific measure is task dependent– For retrieval, we look at whether a model helps improve retrieval
accuracy– We hope more “reasonable” LMs would achieve better retrieval
performance
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 16
What You Should Know• How the source-channel framework can model many
different problems
• Why unigram LMs seem to be sufficient for IR
• Zipf’s law
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 17
Systematic Review of Language Models for IR
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 18
Representative LMs for IR (up to 2006)1998 1999 2000 2001 2002 2003
Beyond unigram Song & Croft 99
Smoothing examined Zhai & Lafferty 01a
Bayesian Query likelihoodZaragoza et al. 03.
Theoretical justificationLafferty & Zhai 01a,01b
Two-stage LMsZhai & Lafferty 02
Lavrenko 04Kraaij 04
Zhai 02Dissertations Hiemstra 01Berger 01
Ponte 98
Translation modelBerger & Lafferty 99
Basic LM (Query Likelihood)
URL prior Kraaij et al. 02
Lavrenko et al. 02 Ogilvie & Callan 03Zhai et al. 03
Xu et al. 01 Zhang et al. 02Cronen-Townsend et al. 02
Si et al. 02
Special IR tasks
Xu & Croft 99
2004
Parsimonious LMHiemstra et al. 04
Cluster smoothingLiu & Croft 04; Tao et al. 06
Relevance LM Lavrenko & Croft 01
Dependency LMGao et al. 04
Model-based FBZhai & Lafferty 01b
Rel. Query FBNallanati et al 03
Query likelihood scoring Ponte & Croft 98 Hiemstra & Kraaij 99; Miller et al. 99
ParametersensitivityNg 00
Title LMJin et al. 02
Term-specific smoothingHiemstra 02
Concept Likelihood Srikanth & Srihari 03
Time priorLi & Croft 03
Shen et al. 05
Srikanth 04
Kurland & Lee 05
Pesudo QueryKurland et al. 05
Rebust Est.Tao & Zhai 06
Thesauri Cao et al. 05
Query expansionBai et al. 05
2005 -
Markov-chain query modelLafferty & Zhai 01b
Query/RelModel &
Feedback
Cluster LMKurland & Lee 04Improved
Basic LM
Tan et al. 06
Tao 06Kurland 06
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 19
Ponte & Croft’s Pioneering Work [Ponte & Croft 98]
• Contribution 1: – A new “query likelihood” scoring method: p(Q|D) – [Maron and Kuhns 60] had the idea of query likelihood, but didn’t
work out how to estimate p(Q|D)
• Contribution 2:– Connecting LMs with text representation and weighting in IR– [Wong & Yao 89] had the idea of representing text with a
multinomial distribution (relative frequency), but didn’t study the estimation problem
• Good performance is reported using the simple query likelihood method
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 20
Early Work (1998-1999)• At about the same time as SIGIR 98, in TREC 7, two groups
explored similar ideas independently: BBN [Miller et al., 99] & Univ. of Twente [Hiemstra & Kraaij 99]
• In TREC-8, Ng from MIT motivated the same query likelihood method in a different way [Ng 99]
• All following the simple query likelihood method; methods differ in the way the model is estimated and the event model for the query
• All show promising empirical results
• Main problems: – Feedback is explored heuristically– Lack of understanding why the method works….
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 21
Later Work (1999-)• Attempt to understand why LMs work [Zhai & Lafferty
01a, Lafferty & Zhai 01a, Ponte 01, Greiff & Morgan 03, Sparck Jones et al. 03, Lavrenko 04]
• Further extend/improve the basic LMs [Song & Croft 99, Berger & Lafferty 99, Jin et al. 02, Nallapati & Allan 02, Hiemstra 02, Zaragoza et al. 03, Srikanth & Srihari 03, Nallapati et al 03, Li &Croft 03, Gao et al. 04, Liu & Croft 04, Kurland & Lee 04,Hiemstra et al. 04,Cao et al. 05, Tao et al. 06]
• Explore alternative ways of using LMs for retrieval (mostly query/relevance model estimation) [Xu & Croft 99, Lavrenko & Croft 01, Lafferty & Zhai 01a, Zhai & Lafferty 01b, Lavrenko 04, Kurland et al. 05, Bai et al. 05,Tao & Zhai 06]
• Explore the use of SLMs for special retrieval tasks [Xu & Croft 99, Xu et al. 01, Lavrenko et al. 02, Cronen-Townsend et al. 02, Zhang et al. 02, Ogilvie & Callan 03, Zhai et al. 03, Kurland & Lee 05, Shen et al. 05, Balog et al. 06, Fang & Zhai 07]
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 22
Review of LM for IR:
Part 1. Basic Language Modeling Approach
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 23
The Basic LM Approach[Ponte & Croft 98]
Document
Text miningpaper
Food nutritionpaper
Language Model
…text ?mining ?assocation ?clustering ?…food ?…
…food ?nutrition ?healthy ?diet ?…
Query = “data mining algorithms”
? Which model would most likely have generated this query?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 24
Ranking Docs by Query Likelihood
d1
d2
dN
qd1
d2
dN
Doc LM
p(q| d1)
p(q| d2)
p(q| dN)
Query likelihood
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 25
Modeling Queries: Different Assumptions• Multi-Bernoulli: Modeling word presence/absence
– q= (x1, …, x|V|), xi =1 for presence of word wi; xi =0 for absence
– Parameters: {p(wi=1|d), p(wi=0|d)} p(wi=1|d)+ p(wi=0|d)=1
• Multinomial (Unigram LM): Modeling word frequency– q=q1,…qm , where qj is a query word
– c(wi,q) is the count of word wi in query q
– Parameters: {p(wi|d)} p(w1|d)+… p(w|v||d) = 1
| | | | | |
1 | |1 1, 1 1, 0
( ( ,..., ) | ) ( | ) ( 1| ) ( 0 | )i i
V V V
V i i i ii i x i x
p q x x d p w x d p w d p w d
| |( , )
11 1
( ... | ) ( | ) ( | ) i
Vmc w q
m j ij i
p q q q d p q d p w d
[Ponte & Croft 98] uses Multi-Bernoulli; most other work uses multinomialMultinomial seems to work better [Song & Croft 99, McCallum & Nigam 98,Lavrenko 04]
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 26
Retrieval as LM Estimation
• Document ranking based on query likelihood | |
1 1
1 2
log ( | ) log ( | ) ( , ) log ( | )
, ...
Vm
i i ii i
m
p q d p q d c w q p w d
where q q q q
• Retrieval problem Estimation of p(wi|d)
• Smoothing is an important issue, and distinguishes different approaches
• Many smoothing methods are available
Document language model
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 27
Which smoothing method is the best?
It depends on the data and the task!
Cross validation is generally used to choose the best method and/or set the smoothing parameters…
For retrieval, Dirichlet prior performs well…
Backoff smoothing [Katz 87] doesn’t work well due to a lack of 2nd-stage smoothing…
Note that many other smoothing methods existSee [Chen & Goodman 98] and other publications in speech recognition…
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 28
Comparison of Three Methods[Zhai & Lafferty 01a]
Query Type J elinek-Mercer Dirichlet Abs. DiscountingTitle 0.228 0.256 0.237Long 0.278 0.276 0.260
Relative performance of JM, Dir. and AD
0
0.1
0.2
0.3
JM DIR AD
Method
precision
TitleQuery
LongQuery
Comparison is performed on a variety of test collections
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 29
The Dual-Role of Smoothing [Zhai & Lafferty 02]
Verbosequeries
Keywordqueries
Why does query type affect smoothing sensitivity?
long
short
short
long
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 30
Query = “the algorithms for data mining”
Another Reason for Smoothing
p( “algorithms”|d1) = p(“algorithm”|d2)p( “data”|d1) < p(“data”|d2)
p( “mining”|d1) < p(“mining”|d2)
So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal…
Content words
Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)…
pDML(w|d1): 0.04 0.001 0.02 0.002 0.003 pDML(w|d2): 0.02 0.001 0.01 0.003 0.004
Query = “the algorithms for data mining”P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309Smoothed p(w|d2): 0.182 0.000109 0.181 0.000309 0.000409
)!2|()1|(),|(9.0)|(1.0)|( dqpdqpREFwpdwpdwpwithsmoothingAfter DML
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 31
Two-stage Smoothing [Zhai & Lafferty 02]
c(w,d)
|d|P(w|d) =
+p(w|C)
+
Stage-1
-Explain unseen words-Dirichlet prior(Bayesian)
Collection LM
(1-) + p(w|U)
Stage-2
-Explain noise in query-2-component mixture
User background modelCan be approximated by p(w|C)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 32
Estimating using leave-one-out [Zhai & Lafferty 02]
P(w1|d- w1)
P(w2|d- w2)
N
i Vw i
ii d
CwpdwcdwcCl1
1 )1||
)|(1),(log(),()|(
log-likelihood
)ˆ C|(μlargmaxμ 1μ
Maximum Likelihood Estimator
Newton’s Method
Leave-one-outw1
w2
P(wn|d- wn)
wn
...
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 33
Why would “leave-one-out” work?
abc abc ab c d dabc cd d d
abd ab ab ab abcd d e cd e
20 word by author1
20 word by author2
abc abc ab c d dabe cb e f
acf fb ef aff abefcdc db ge f s
Suppose we keep sampling and get 10 more words. Which author is likely to
“write” more new words?
Now, suppose we leave “e” out…
1 20 1(" " | 1) (" " | 1) (" " | )19 20 19 200 20 0(" " | 2) (" " | 2) (" " | )
19 20 19 20
ml smooth
ml smooth
p e author p e author p e REF
p e author p e author p e REF
must be big! more smoothing
doesn’t have to be big
The amount of smoothing is closely related to the underlying vocabulary size
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 34
Estimating using Mixture Model [Zhai & Lafferty 02]
Query
Q=q1…qm
1
N
...
Maximum Likelihood Estimator Expectation-Maximization (EM) algorithm
P(w|d1)d1
P(w|dN)dN
… ...
Stage-1
(1-)p(w|d1)+ p(w|U)
(1-)p(w|dN)+ p(w|U)
Stage-2
ˆ( , ) ( | )( | )
ˆ| |j i j
j ii
c q d p q Cp q d
d
Estimated in stage-1
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 35
Collection query Optimal-JM Optimal-Dir Auto-2stageSK 20.3% 23.0% 22.2%*LK 36.8% 37.6% 37.4%SV 18.8% 20.9% 20.4%LV 28.8% 29.8% 29.2%SK 19.4% 22.3% 21.8%*LK 34.8% 35.3% 35.8%SV 17.2% 19.6% 19.9%LV 27.7% 28.2% 28.8%*SK 17.9% 21.5% 20.0%LK 32.6% 32.6% 32.2%SV 15.6% 18.5% 18.1%LV 26.7% 27.9% 27.9%*
AP88-89
WSJ87-92
ZIFF1-2
Automatic 2-stage results Optimal 1-stage results [Zhai & Lafferty 02]
Average precision (3 DB’s + 4 query types, 150 topics)* Indicates significant difference
Completely automatic tuning of parameters IS POSSIBLE!
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 36
Variants of the Basic LM Approach• Different smoothing strategies
– Hidden Markov Models (essentially linear interpolation) [Miller et al. 99]
– Smoothing with an IDF-like reference model [Hiemstra & Kraaij 99]
– Performance tends to be similar to the basic LM approach– Many other possibilities for smoothing [Chen & Goodman 98]
• Different priors– Link information as prior leads to significant improvement of Web
entry page retrieval performance [Kraaij et al. 02] – Time as prior [Li & Croft 03]
– PageRank as prior [Kurland & Lee 05]
• Passage retrieval [Liu & Croft 02]
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 37
Review of LM for IR:
Part 2. Advanced Language Modeling Approaches
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 38
Improving the Basic LM Approach• Capturing limited dependencies
– Bigrams/Trigrams [Song & Croft 99]; Grammatical dependency [Nallapati & Allan 02, Srikanth & Srihari 03, Gao et al. 04]
– Generally insignificant improvement as compared with other extensions such as feedback
• Full Bayesian query likelihood [Zaragoza et al. 03]
– Performance similar to the basic LM approach• Translation model for p(Q|D,R) [Berger & Lafferty 99, Jin et al. 02,Cao et al.
05]
– Address polesemy and synonyms; improves over the basic LM methods, but computationally expensive
• Cluster-based smoothing/scoring [Liu & Croft 04, Kurland & Lee 04,Tao et al. 06]
– Improves over the basic LM, but computationally expensive• Parsimonious LMs [Hiemstra et al. 04]:
– Using a mixture model to “factor out” non-discriminative words
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 39
Translation Models • Directly modeling the “translation” relationship
between words in the query and words in a doc
• When relevance judgments are available, (q,d) serves as data to train the translation model
• Without relevance judgments, we can use synthetic data [Berger & Lafferty 99], <title, body>[Jin et al. 02] , or thesauri [Cao et al. 05]
1
( | , ) ( | ) ( | )j
m
t i j jw Vi
p Q D R p q w p w D
Basic translation model
Translation model Regular doc LM
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 40
Cluster-based Smoothing/Scoring• Cluster-based smoothing: Smooth a document LM with a
cluster of similar documents [Liu & Croft 04]: improves over the basic LM, but insignificantly
• Document expansion smoothing: Smooth a document LM with the neighboring documents (essentially one cluster per document) [Tao et al. 06] : improves over the basic LM more significantly
• Cluster-based query likelihood: Similar to the translation model, but “translate” the whole document to the query through a set of clusters [Kurland & Lee 04]
( | , ) ( | ) ( | )
C Clusters
p Q D R p Q C p C D
How likely doc D
belongs to cluster C
Only effective when interpolated with the basic LM scores
Likelihood of Q given C
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 41
Feedback and Doc/Query Generation
( 1| , ) ( | , 1)O R Q D P Q D R
( | , 1)( 1| , )( | , 0)
P D Q RO R Q DP D Q R
Classic Prob. Model
Query likelihood(“Language Model”)
Rel. doc model
NonRel. doc model
“Rel. query” model
P(D|Q,R=1)
P(D|Q,R=0)
P(Q|D,R=1)
(q1,d1,1)(q1,d2,1)(q1,d3,1)(q1,d4,0)(q1,d5,0)
(q3,d1,1)(q4,d1,1)(q5,d1,1)(q6,d2,1)(q6,d3,0)
ParameterEstimation
Initial retrieval: - query as rel doc vs. doc as rel query - P(Q|D,R=1) is more accurate
Feedback: - P(D|Q,R=1) can be improved for the current query and future doc - P(Q|D,R=1) can also be improved, but
for current doc and future query
Doc-based feedbackQuery-based feedback
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 42
Overview of Feedback Techniques • Feedback as machine learning: many possibilities
– Standard ML: Given examples of relevant (and non-relevant) documents, learn how to classify a new document as either “relevant” or “non-relevant”.
– “Modified” ML: Given a query and examples of relevant (and non-relevant) documents, learn how to rank new documents based on relevance
– Challenges: • Sparse data • Censored sample• How to deal with query?
– Modeling noise in pseudo feedback (as semi-supervised learning)
• Feedback as query expansion: traditional IR – Step 1: Term selection– Step 2: Query expansion– Step 3: Query term re-weighting
• Traditional IR is still robust (Rocchio), but ML approaches can potentially be more accurate
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 43
Difficulty in Feedback with Query Likelihood
• Traditional query expansion [Ponte 98, Miller et al. 99, Ng 99]
– Improvement is reported, but there is a conceptual inconsistency – What’s an expanded query, a piece of text or a set of terms?
• Avoid expansion– Query term reweighting [Hiemstra 01, Hiemstra 02]
– Translation models [Berger & Lafferty 99, Jin et al. 02]– Only achieving limited feedback
• Doing relevant query expansion instead [Nallapati et al 03]
• The difficulty is due to the lack of a query/relevance model
• The difficulty can be overcome with alternative ways of using LMs for retrieval (e.g., relevance model [Lavrenko & Croft 01] , Query model estimation [Lafferty & Zhai 01b; Zhai & Lafferty 01b])
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 44
Two Alternative Ways of Using LMs• Classic Probabilistic Model :Doc-Generation as opposed to
Query-generation
– Natural for relevance feedback– Challenge: Estimate p(D|Q,R=1) without relevance feedback;
relevance model [Lavrenko & Croft 01] provides a good solution
• Probabilistic Distance Model :Similar to the vector-space model, but with LMs as opposed to TF-IDF weight vectors– A popular distance function: Kullback-Leibler (KL) divergence,
covering query likelihood as a special case
– Retrieval is now to estimate query & doc models and feedback is treated as query LM updating [Lafferty & Zhai 01b; Zhai & Lafferty 01b]
( | , 1) ( | , 1)( 1| , )( | , 0) ( )
P D Q R P D Q RO R Q DP D Q R P D
Both methods outperform the basic LM significantly
( , ) ( || ), ( | ) log ( | )Q D Q Dw V
score Q D D essentially p w p w
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 45
Relevance Model Estimation[Lavrenko & Croft 01]
• Question: How to estimate P(D|Q,R) (or p(w|Q,R)) without relevant documents?
• Key idea: – Treat query as observations about p(w|Q,R) – Approximate the model space with document models
• Two methods for decomposing p(w,Q)– Independent sampling (Bayesian model averaging)
– Conditional sampling: p(w,Q)=p(w)p(Q|w)1
( | , ) ( | ) ( | , ) ( | ) ( | ) ( | )
( | ) ( | ) ( | ) ( | ) ( | )
D D D D D D D
m
D D D D j DD C D C j
p w Q R p w p Q R d p w p R p Q d
p w p R p Q p w p q
1
( | , 1) ( ) ( | ) ( ) ( | ) ( | )
( | ) ( )( ) ( | ) ( ) ( | )( )
m
iD Ci
D C
p w Q R p w p Q w p w p q D p D w
p w D p Dp w p w D p D p D wp w
( | ) ( )( | )( )
p w D p wp D wp D
Original formula in [Lavranko &Croft 01]
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 46
Query Model Estimation[Lafferty & Zhai 01b, Zhai & Lafferty 01b]
• Question: How to estimate a better query model than the ML estimate based on the original query?
• “Massive feedback”: Improve a query model through co-occurrence pattern learned from – A document-term Markov chain that outputs the query
[Lafferty & Zhai 01b]– Thesauri, corpus [Bai et al. 05,Collins-Thompson & Callan 05]
• Model-based feedback: Improve the estimate of query model by exploiting pseudo-relevance feedback– Update the query model by interpolating the original query
model with a learned feedback model [ Zhai & Lafferty 01b] – Estimate a more integrated mixture model using pseudo-
feedback documents [ Tao & Zhai 06]
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 47
Review of LM for IR:
Part 3. KL-divergence retrieval model and feedback
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 48
Kullback-Leibler (KL) Divergence Retrieval Model
• Unigram similarity model
• Retrieval Estimation of Q and D
• Special case: = empirical distribution of q recovers “query-likelihood”
ˆ ˆ( ; ) ( || )ˆ ˆ ˆ ˆ( | ) log ( | ) ( ( | ) log ( | ))
Q D
Q D Q Qw w
Sim d q D
p w p w p w p w
query entropy(ignored for ranking)
Q̂
( | ) 0
( | )ˆ( ; ) [ ( | ) log ] log( | )
ii Q
seen ii Q d
w d d ip w
p w dsim q d p wp w C
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 49
Feedback as Model Interpolation(Rocchio for Language Models)
Query Q
D
)||( DQD
Document DResults
Feedback Docs F={d1, d2 , …, dn}
FQQ )1('
Generative model
Q
F=0
No feedbackFQ '
=1
Full feedback
QQ '
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 50
Generative Mixture Model
w
w
F={d1, …, dn}
log ( | ) ( ; ) log[(1 ) ( | ) ( | )]ii w
p F c w d p w p w C )|(logmaxarg
FpF Maximum Likelihood
P(w| )
P(w| C)
1-
P(source)
Background words
Topic words
= Noise in feedback documents
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 51
How to Estimate F?
the 0.2a 0.1we 0.01to 0.02…text 0.0001mining 0.00005…
KnownBackground
p(w|C)
…text =? mining =? association =?word =? …
Unknownquery topicp(w|F)=?
“Text mining”
=0.7
=0.3
ObservedDoc(s)
Suppose, we know the identity of each word ...
MLEstimator
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 52
Can We Guess the Identity?
Identity (“hidden”) variable: zi {1 (background), 0(topic)}
thepaperpresentsatextminingalgorithmthepaper...
zi
111100010...
Suppose the parameters are all known, what’s a reasonable guess of zi? - depends on (why?) - depends on p(w|C) and p(w|F) (how?)
( 1) ( | 1)( 1| )
( 1) ( | 1) ( 0) ( | 0)( | )
( | ) (1 ) ( | )
i i ii i
i i i i i i
i
i i F
p z p w zp z w
p z p w z p z p w zp w C
p w C p w
E-step
Initially, set p(w| F) to some random value, then iterate …
M-step
vocabularywjj
nj
iin
iFi
new
j
wzpFwcwzpFwcwp
))|1(1)(,())|1(1)(,()|( )(
)(
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 53
An Example of EM Computation
Iteration 1 Iteration 2 Iteration 3 Word # P(w|C) P(w|F) P(z=1) P(w|F) P(z=1) P(w|F) P(z=1)
The 4 0.5 0.25 0.67 0.20 0.71 0.18 0.74 Paper 2 0.3 0.25 0.55 0.14 0.68 0.10 0.75 Text 4 0.1 0.25 0.29 0.44 0.19 0.50 0.17 Mining 2 0.1 0.25 0.29 0.22 0.31 0.22 0.31
Log-Likelihood -16.96 -16.13 -16.02
Assume =0.5
Expectation-Step:Augmenting data by guessing hidden variables
Maximization-Step With the “augmented data”, estimate parameters
using maximum likelihood
vocabularywjj
nj
iin
iFi
n
Fin
i
iii
n
j
wzpFwcwzpFwcwp
wpCwpCwpwzp
))|1(1)(,())|1(1)(,()|(
)|()1()|()|()|1(
)(
)()1(
)()(
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 54
Example of Feedback Query Model
W p(W| )security 0.0558airport 0.0546
beverage 0.0488alcohol 0.0474bomb 0.0236
terrorist 0.0217author 0.0206license 0.0188bond 0.0186
counter-terror 0.0173terror 0.0142
newsnet 0.0129attack 0.0124
operation 0.0121headline 0.0121
Trec topic 412: “airport security”
W p(W| )the 0.0405
security 0.0377airport 0.0342
beverage 0.0305alcohol 0.0304
to 0.0268of 0.0241
and 0.0214author 0.0156bomb 0.0150
terrorist 0.0137in 0.0135
license 0.0127state 0.0127
by 0.0125
=0.9 =0.7
FF
Mixture model approach
Web database
Top 10 docs
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 55
Model-based feedback Improves over Simple LM [Zhai & Lafferty 01b]
Simple LM Mixture Improv. Div.Min. Improv.AvgPr 0.21 0.296 pos +41% 0.295 pos +40%InitPr 0.617 0.591 pos -4% 0.617 pos +0%Recall 3067/4805 3888/4805 pos +27% 3665/4805 pos +19%AvgPr 0.256 0.282 pos +10% 0.269 pos +5%InitPr 0.729 0.707 pos -3% 0.705 pos -3%Recall 2853/4728 3160/4728 pos +11% 3129/4728 pos +10%AvgPr 0.281 0.306 pos +9% 0.312 pos +11%InitPr 0.742 0.732 pos -1% 0.728 pos -2%Recall 1755/2279 1758/2279 pos +0% 1798/2279 pos +2%
collection
AP88-89
TREC8
WEB
Translation models, Relevance models, and Feedback-based query models have all been shown to improve performance significantly over the simple LMs (Parameter tuning is necessary in many cases, but see
[Tao & Zhai 06] for “parameter-free” pseudo feedback)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 56
What Should You Know
• The KL-divergence retrieval formula as a generalization of the query likelihood method
• How the mixture model for feedback works
• Know how to estimate the simple mixture model using EM
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 57
Review of LM for IR:
Part 4. Language models for special retrieval tasks
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 58
Cross-Lingual IR • Use query in language A (e.g., English) to retrieve
documents in language B (e.g., Chinese)
• Cross-lingual p(Q|D,R) [Xu et al 01]
• Cross-lingual p(D|Q,R) [Lavrenko et al 02]
1
( | , ) [ ( | ) (1 ) ( | ) ( | )]Chinese
m
i trans ic Vi
p Q D R p q REF p c D p q c
English Chinese
English Chinese word
1
1
1( , ) 1
11
( , ... )( | , )
( ... )
( , ... ) ( , ) ( | ) ( | )
( , ... ) ( ) ( | ) ( | ) ( | ) ( | ) ( | )
E C
C Chinese
m
m
m
m E C c i EM M M i
m
m C c i C i C trans i CM M c Vi
p c q qp c Q R
p q q
p c q q p M M p c M p q M
p c q q p M p c M p q M p q M p q c p c M
Method 1:
Method 2:
Translation model
Estimate with a bilingual lexicon
Or Parallel corpora
Estimate with parallel corpora
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 59
Distributed IR• Retrieve documents from multiple collections
• The task is generally decomposed into two subtasks: Collection selection and result fusion
• Using LMs for collection selection [Xu & Croft 99, Si et al. 02]
– Treat collection selection as “retrieving collections” as opposed to “documents”
– Estimate each collection model by maximum likelihood estimate [Si et al. 02] or clustering [Xu & Croft 99]
• Using LMs for result fusion [ Si et al. 02]
– Assume query likelihood scoring for all collections, but on each collection, a distinct reference LM is used for smoothing
– Adjust the bias score p(Q|D,Collection) to recover the fair score p(Q|D)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 60
1 2
1
11
...
( | , 1) ( | , 1)
( | , 1) ( | , 1)
m
m
ii
m k
j i jji
Q q q q
p Q D R p q D R
s D D R p q D R
Structured Document Retrieval[Ogilvie & Callan 03]
Title
Abstract
Body-Part1
Body-Part2
…
D
D1
D2
D3
Dk
-Want to combine different parts of a document with appropriate weights-Anchor text can be treated as a “part” of a document- Applicable to XML retrieval
“part selection” prob. Serves as weight for Dj
Can be trained using EM
Select Dj and generate a query word using Dj
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 61
Personalized/Context-Sensitive Search
[Shen et al. 05, Tan et al. 06]
• User information and search context can be used to estimate a better query model
ˆ arg max ( | , )Q p Query Collection
Refinement of this model leads to specific retrieval formulasSimple models often end up interpolating many unigram language
models based on different sources of evidence, e.g., short-term search history [Shen et al. 05] or long-term search history [Tan et al. 06]
ˆ arg max ( | , , , )Q p Query User SearchContext Collection
Context-independent Query LM:
Context-sensitive Query LM:
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 62
Modeling Redundancy• Given two documents D1 and D2, decide how redundant D1 (or D2) is
w.r.t. D2 (or D1)
• Redundancy of D1 “to what extent can D1 be explained by a model estimated based on D2”
• Use a unigram mixture model [Zhai 02]
• See [Zhang et al. 02] for a 3-component redundancy model
• Along a similar line, we could measure document similarity in an asymmetric way [Kurland & Lee 05]
2 2
2
1 1
*1
log ( | , ) ( , ) log[ ( | ) (1 ) ( | )]
arg max log ( | , )
D Dw V
D
p D c w D p w p w REF
p D
Maximum Likelihood estimatorEM algorithm
Reference LMLM for D2
Measure of redundancy
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 63
Predicting Query Difficulty [Cronen-Townsend et al. 02]
• Observations:– Discriminative queries tend to be easier– Comparison of the query model and the collection model can
indicate how discriminative a query is
• Method:– Define “query clarity” as the KL-divergence between an
estimated query model or relevance model and the collection LM
– An enriched query LM can be estimated by exploiting pseudo feedback (e.g., relevance model)
• Correlation between the clarity scores and retrieval performance is found
( | )( ) ( | ) log
( | )Q
Qw
p wclarity Q p w
p w Collection
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 64
Expert Finding [Balog et al. 06, Fang & Zhai 07]
• Task: Given a topic T, a list of candidates {Ci} , and a collection of support documents S={Di}, rank the candidates according to the likelihood that a candidate C is an expert on T.
• Retrieval analogy:– Query = topic T– Document = Candidate C– Rank according to P(R=1|T,C)– Similar derivations to those on slides 55-56, 64 can be made
• Candidate generation model:
• Topic generation model:
SD
RankRTDpRDCpCTRO )1,|()1,|(),|1(
)0|()1|(
)1,'|()1,|()1,|(),|1(
'
RCpRCp
RDCpRDCpRDTpCTRO
SDSD
Rank
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 65
Summary
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 66
SLMs vs. Traditional IR• Pros:
– Statistical foundations (better parameter setting)– More principled way of handling term weighting – More powerful for modeling subtopics, passages,..– Leverage LMs developed in related areas – Empirically as effective as well-tuned traditional models with
potential for automatic parameter tuning
• Cons:– Lack of discrimination (a common problem with generative models)– Less robust in some cases (e.g., when queries are semi-structured)– Computationally complex– Empirically, performance appears to be inferior to well-tuned full-
fledged traditional methods (at least, no evidence for beating them)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 67
What We Have Achieved So Far• Framework and justification for using LMs for IR
• Several effective models are developed – Basic LM with Dirichlet prior smoothing is a reasonable baseline– Basic LM with informative priors often improves performance – Translation model handles polysemy & synonyms– Relevance model incorporates LMs into the classic probabilistic IR
model– KL-divergence model ties feedback with query model estimation – Mixture models can model redundancy and subtopics
• Completely automatic tuning of parameters is possible
• LMs can be applied to virtually any retrieval task with great potential for modeling complex IR problems
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 68
Challenges and Future Directions• Challenge 1: Establish a robust and effective LM that
– Optimizes retrieval parameters automatically– Performs as well as or better than well-tuned traditional retrieval
methods with pseudo feedback– Is as efficient as traditional retrieval methods
• Challenge 2: Demonstrate consistent and substantial improvement by going beyond unigram LMs– Model limited dependency between terms – Derive more principled weighting methods for phrases
Can LMs consistently (convincingly) outperform traditional methods without sacrificing efficiency?
Can we do much better by going beyond unigram LMs?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 69
Challenges and Future Directions (cont.)
• Challenge 3: Develop LMs that can support “life-time learning” – Develop LMs that can improve accuracy for a current query
through learning from past relevance judgments– Support collaborative information retrieval
• Challenge 4: Develop LMs that can model document structures and subtopics– Recognize query-specific boundaries of relevant passages– Passage-based/subtopic-based feedback – Combine different structural components of a document
How can we learn effectively from past relevance judgments?
How can we break the document unit in a principled way?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 70
Challenges and Future Directions (cont.)
• Challenge 5: Develop LMs to support personalized search– Infer and track a user’s interests with LMs– Incorporate user’s preferences and search context in retrieval – Customize/organize search results according to user’s
interests
• Challenge 6: Generalize LMs to handle relational data– Develop LMs for semi-structured data (e.g., XML)– Develop LMs to handle structured queries– Develop LMs for keyword search in relational databases
How can we exploit user information and search context to improve search?
What role can LMs play when combining text with relational data?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 71
Challenges and Future Directions (cont.)
• Challenge 7: Develop LMs for hypertext retrieval– Combine LMs with link information – Modeling and exploiting anchor text – Develop a unified LM for hypertext search
• Challenge 8: Develop LMs for retrieval with complex information needs, e.g., – Subtopic retrieval – Readability constrained retrieval– Entity retrieval (e.g. expert search)
How can we exploit LMs to develop models for complex retrieval tasks?
How can we develop an effective unified retrieval model for Web search?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 72
What You Should Know• General picture of language models for IR
• The KL-divergence retrieval formula as a generalization of the query likelihood method
• How the mixture model for feedback works
• Know how to estimate the simple mixture model using EM
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 73
References[Agichtein & Cucerzan 05] E. Agichtein and S. Cucerzan, Predicting accuracy of extracting information from unstructured text
collections, Proceedings of ACM CIKM 2005. pages 413-420.[Baeza-Yates & Ribeiro-Neto 99] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999.[Bai et al. 05] Jing Bai, Dawei Song, Peter Bruza, Jian-Yun Nie, Guihong Cao, Query expansion using term relationships in
language models for information retrieval, Proceedings of ACM CIKM 2005, pages 688-695.[Balog et al. 06] K. Balog, L. Azzopardi, M. de Rijke, Formal models for expert finding in enterprise corpora, Proceedings of ACM
SIGIR 2006, pages 43-50. [Berger & Lafferty 99] A. Berger and J. Lafferty. Information retrieval as statistical translation. Proceedings of the ACM SIGIR
1999, pages 222-229.[Berger 01] A. Berger. Statistical machine learning for information retrieval. Ph.D. dissertation, Carnegie Mellon University, 2001. [Blei et al. 02] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In T G Dietterich, S. Becker, and Z. Ghahramani, editors,
Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press.[Cao et al. 05] Guihong Cao, Jian-Yun Nie, Jing Bai, Integrating word relationships into language models, Proceedings of ACM
SIGIR 2005, Pages: 298 - 305.[Carbonell and Goldstein 98]J. Carbonell and J. Goldstein, The use of MMR, diversity-based reranking for reordering documents
and producing summaries. In Proceedings of SIGIR'98, pages 335--336.[Chen & Goodman 98] S. F. Chen and J. T. Goodman. An empirical study of smoothing techniques for language modeling.
Technical Report TR-10-98, Harvard University.[Collins-Thompson & Callan 05] K. Collins-Thompson and J. Callan, Query expansing using random walk models, Proceedings
of ACM CIKM 2005, pages 704-711. [Cronen-Townsend et al. 02] Steve Cronen-Townsend, Yun Zhou, and W. Bruce Croft. Predicting query performance. In
Proceedings of the ACM Conference on Research in Information Retrieval (SIGIR), 2002.[Croft & Lafferty 03] W. B. Croft and J. Lafferty (ed), Language Modeling and Information Retrieval. Kluwer Academic
Publishers. 2003.[Fang et al. 04] H. Fang, T. Tao and C. Zhai, A formal study of information retrieval heuristics, Proceedings of ACM SIGIR 2004.
pages 49-56.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 74
References (cont.)[Fang & Zhai 07] H. Fang and C. Zhai, Probabilistic models for expert finding, Proceedings of ECIR 2007. [Fox 83] E. Fox. Expending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and
Multiple Concept Types. PhD thesis, Cornell University. 1983.[Fuhr 01] N. Fuhr. Language models and uncertain inference in information retrieval. In Proceedings of the Language
Modeling and IR workshop, pages 6--11.[Gao et al. 04] J. Gao, J. Nie, G. Wu, and G. Cao, Dependence language model for information retrieval, In Proceedings of
ACM SIGIR 2004. [Good 53] I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3
and 4):237--264, 1953.[Greiff & Morgan 03] W. Greiff and W. Morgan, Contributions of Language Modeling to the Theory and Practice of IR, In W.
B. Croft and J. Lafferty (eds), Language Modeling for Information Retrieval, Kluwer Academic Pub. 2003.[Grossman & Frieder 04] D. Grossman and O. Frieder, Information Retrieval: Algorithms and Heuristics, 2nd Ed, Springer,
2004. [He & Ounis 05] Ben He and Iadh Ounis, A study of the Dirichlet priors for term frequency normalisation, Proceedings of
ACM SIGIR 2005, Pages 465 - 471[Hiemstra & Kraaij 99] D. Hiemstra and W. Kraaij, Twenty-One at TREC-7: Ad-hoc and Cross-language track, In
Proceedings of the Seventh Text REtrieval Conference (TREC-7), 1999. [Hiemstra 01] D. Hiemstra. Using Language Models for Information Retrieval. PhD dissertation, University of Twente,
Enschede, The Netherlands, January 2001.[Hiemstra 02] D. Hiemstra. Term-specific smoothing for the language modeling approach to information retrieval: the
importance of a query term. In Proceedings of ACM SIGIR 2002, 35-41 [Hiemstra et al. 04] D. Hiemstra, S. Robertson, and H. Zaragoza. Parsimonious language models for information retrieval,
In Proceedings of ACM SIGIR 2004. [Hofmann 99] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings on the 22nd annual international ACM-
SIGIR 1999, pages 50-57. [Jarvelin & Kekalainen 02] Cumulated gain-based evaluation of IR techniques, ACM TOIS, Vol. 20, No. 4, 422-446, 2002. [Jelinek 98] F. Jelinek, Statistical Methods for Speech Recognition, Cambirdge: MIT Press, 1998.[Jelinek & Mercer 80] F. Jelinek and R. L. Mercer. Interpolated estimation of markov source parameters from sparse data.
In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice. 1980. Amsterdam, North-Holland,.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 75
References (cont.)[Jeon et al. 03] J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation and Retrieval using Cross-media
Relevance Models, In Proceedings of ACM SIGIR 2003[Jin et al. 02] R. Jin, A. Hauptmann, and C. Zhai, Title language models for information retrieval, In Proceedings of ACM SIGIR
2002. [Kalt 96] T. Kalt. A new probabilistic model of text classication and retrieval. University of Massachusetts Technical report
TR98-18,1996.[Katz 87] S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer.
IEEE Transactions on Acoustics, Speech and Signal Processing, volume ASSP-35:400--401.[Kraaij et al. 02] W. Kraaij,T. Westerveld, D. Hiemstra: The Importance of Prior Probabilities for Entry Page Search.
Proceedings of SIGIR 2002, pp. 27-34 [Kraaij 04] W. Kraaij. Variations on Language Modeling for Information Retrieval, Ph.D. thesis, University of Twente, 2004, [Kurland & Lee 04] O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. In
Proceedings of ACM SIGIR 2004. [Kurland et al. 05] Oren Kurland, Lillian Lee, Carmel Domshlak, Better than the real thing?: iterative pseudo-query processing
using cluster-based language models, Proceedings of ACM SIGIR 2005. pages 19-26. [Kurland & Lee 05] Oren Kurland and Lillian Lee, PageRank without hyperlinks: structural re-ranking using links induced by
language models, Proceedings of ACM SIGIR 2005. pages 306-313.[Lafferty and Zhai 01a] J. Lafferty and C. Zhai, Probabilistic IR models based on query and document generation. In
Proceedings of the Language Modeling and IR workshop, pages 1--5.[Lafferty & Zhai 01b] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information
retrieval. In Proceedings of the ACM SIGIR 2001, pages 111-119.[Lavrenko & Croft 01] V. Lavrenko and W. B. Croft. Relevance-based language models . In Proceedings of the ACM SIGIR
2001, pages 120-127.[Lavrenko et al. 02] V. Lavrenko, M. Choquette, and W. Croft. Cross-lingual relevance models. In Proceedings of SIGIR 2002,
pages 175-182.[Lavrenko 04] V. Lavrenko, A generative theory of relevance. Ph.D. thesis, University of Massachusetts. 2004.[Li & Croft 03] X. Li, and W.B. Croft, Time-Based Language Models, In Proceedings of CIKM'03, 2003 [Liu & Croft 02] X. Liu and W. B. Croft. Passage retrieval based on language models . In Proceedings of CIKM 2002, pages 15-
19.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 76
References (cont.)[Liu & Croft 04] X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proceedings of ACM SIGIR 2004. [MacKay & Peto 95] D. MacKay and L. Peto. (1995). A hierarchical Dirichlet language model. Natural Language Engineering, 1(3):289--
307.[Maron & Kuhns 60] M. E. Maron and J. L. Kuhns, On relevance, probabilistic indexing and information retrieval. Journal of the ACM,
7:216--244.[McCallum & Nigam 98] A. McCallum and K. Nigam (1998). A comparison of event models for Naïve Bayes text classification. In AAAI-
1998 Learning for Text Categorization Workshop, pages 41--48. [Miller et al. 99] D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden Markov model information retrieval system. In Proceedings of
ACM-SIGIR 1999, pages 214-221.[Minka & Lafferty 03] T. Minka and J. Lafferty, Expectation-propagation for the generative aspect model, In Proceedings of the UAI 2002,
pages 352--359.[Nallanati & Allan 02] Ramesh Nallapati and James Allan, Capturing term dependencies using a language model based on sentence
trees. In Proceedings of CIKM 2002. 383-390 [Nallanati et al 03] R. Nallanati, W. B. Croft, and J. Allan, Relevant query feedback in statistical language modeling, In Proceedings of
CIKM 2003. [Ney et al. 94] H. Ney, U. Essen, and R. Kneser. On Structuring Probabilistic Dependencies in Stochastic Language Modeling. Comput.
Speech and Lang., 8(1), 1-28.[Ng 00]K. Ng. A maximum likelihood ratio information retrieval model. In Voorhees, E. and Harman, D., editors, Proceedings of the
Eighth Text REtrieval Conference (TREC-8), pages 483--492. 2000.[Ogilvie & Callan 03] P. Ogilvie and J. Callan Combining Document Representations for Known Item Search. In Proceedings of the 26th
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2003), pp. 143-150 [Ponte & Croft 98]] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of ACM-SIGIR
1998, pages 275-281. [Ponte 98] J. M. Ponte. A language modeling approach to information retrieval. Phd dissertation, University of Massachusets, Amherst,
MA, September 1998.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 77
References (cont.)[Ponte 01] J. Ponte. Is information retrieval anything more than smoothing? In Proceedings of the Workshop on Language Modeling and
Information Retrieval, pages 37-41, 2001.[Robertson & Sparch-Jones 76] S. Robertson and K. Sparck Jones. (1976). Relevance Weighting of Search Terms. JASIS, 27, 129-146.[Robertson 77] S. E. Robertson. The probability ranking principle in IR . Journal of Documentation, 33:294-304, 1977.[Robertson & Walker 94] S. E. Robertson and S. Walker, Some simple effective approximations to the 2-Poisson model for probabilistic
weighted retrieval. Proceedings of ACM SIGIR 1994. pages 232-241. 1994. [Rosenfeld 00] R. Rosenfeld, Two decades of statistical language modeling: where do we go from here? In Proceedings of IEEE,
volume~88.[Salton et al. 75] G. Salton, A. Wong and C. S. Yang, A vector space model for automatic indexing. Communications of the ACM,
18(11):613--620.[Salton & Buckley 88] G. Salton and C. Buckley, Term weighting approaches in automatic text retrieval, Information Processing and
Management, 24(5), 513-523. 1988. [Shannon 48] Shannon, C. E. (1948).. A mathematical theory of communication. Bell System Tech. J. 27, 379-423, 623-656. [Shen et al. 05] X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval with implicit feedback. In Proceedings of ACM SIGIR
2005.[Si et al. 02] L. Si , R. Jin, J. Callan and P.l Ogilvie. A Language Model Framework for Resource Selection and Results Merging. In
Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM) . 2002[Singhal et al. 96] A. Singhal, C. Buckley, and M. Mitra, Pivoted document length normalization, Proceedings of ACM SIGIR 1996. [Singhal 01] A. Singhal, Modern Information Retrieval: A Brief Overview. Amit Singhal. In IEEE Data Engineering Bulletin 24(4), pages
35-43, 2001.[Song & Croft 99] F. Song and W. B. Croft. A general language model for information retrieval. In Proceedings of Eighth International
Conference on Information and Knowledge Management (CIKM 1999)[Sparck Jones 72] K. Sparck Jones, A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation
28, 11-21, 1972 and 60, 493-502, 2004.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 78
References (cont.)[Sparck Jones et al. 00] K. Sparck Jones, S. Walker, and S. E. Robertson, A probabilistic model of information retrieval: development and
comparative experiments - part 1 and part 2. Information Processing and Management, 36(6):779--808 and 809--840.[Sparck Jones et al. 03] K. Sparck Jones, S. Robertson, D. Hiemstra, H. Zaragoza, Language Modeling and Relevance, In W. B. Croft and J.
Lafferty (eds), Language Modeling for Information Retrieval, Kluwer Academic Pub. 2003.[Srikanth & Srihari 03] M. Srikanth, R. K. Srihari. Exploiting Syntactic Structure of Queries in a Language Modeling Approach to IR. in Proceedings
of Conference on Information and Knowledge Management(CIKM'03).[Srikanth 04] M. Srikanth. Exploiting query features in language modeling approach for information retrieval. Ph.D. dissertation, State University of
New York at Buffalo, 2004. [Tan et al. 06] Bin Tan, Xuehua Shen, and ChengXiang Zhai,, Mining long-term search history to improve search accuracy, Proceedings of ACM
KDD 2006. [Tao et al. 06] Tao Tao, Xuanhui Wang, Qiaozhu Mei, and ChengXiang Zhai, Language model information retrieval with document expansion,
Proceedings of HLT/NAACL 2006. [Tao & Zhai 06] Tao Tao and ChengXiang Zhai, Regularized estimation of mixture models for robust pseudo-relevance feedback. Proceedings of
ACM SIGIR 2006. [Turtle & Croft 91]H. Turtle and W. B. Croft, Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems,
9(3):187--222.[van Rijsbergen 86] C. J. van Rijsbergen. A non-classical logic for information retrieval. The Computer Journal, 29(6).[Witten et al. 99] I.H. Witten, A. Mo#at, and T.C. Bell. Managing Gigabytes - Compressing and Indexing Documents and Images. Academic Press,
San Diego, 2nd edition, 1999.[Wong & Yao 89] S. K. M. Wong and Y. Y. Yao, A probability distribution model for information retrieval. Information Processing and
Management, 25(1):39--53.[Wong & Yao 95] S. K. M. Wong and Y. Y. Yao. On modeling information retrieval with probabilistic inference. ACM Transactions on Information
Systems, 13(1):69--99.[Xu & Croft 99] J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In Proceedings of the ACM SIGIR 1999, pages 15-
19,[Xu et al. 01] J. Xu, R. Weischedel, and C. Nguyen. Evaluating a probabilistic model for cross-lingual information retrieval. In Proceedings of
the ACM-SIGIR 2001, pages 105-110.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 79
References (cont.)
[Zaragoza et al. 03] Hugo Zaragoza, D. Hiemstra and M. Tipping, Bayesian extension to the language model for ad hoc information retrieval. In Proceedings of SIGIR 2003: 4-9.
[Zhai & Lafferty 01a] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the ACM-SIGIR 2001, pages 334-342.
[Zhai & Lafferty 01b] C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval, In Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM 2001).
[Zhai & Lafferty 02] C. Zhai and J. Lafferty. Two-stage language models for information retrieval . In Proceedings of the ACM-SIGIR 2002, pages 49-56.
[Zhai et al. 03] C. Zhai, W. Cohen, and J. Lafferty, Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval, In Proceedings of ACM SIGIR 2003.
[Zhai & Lafferty 06] C. Zhai and J. Lafferty, A risk minimization framework for information retrieval, Information Processing and Management, 42(1), Jan. 2006, pages 31-55.
[Zhai 02] C. Zhai, Language Modeling and Risk Minimization in Text Retrieval, Ph.D. thesis, Carnegie Mellon University, 2002.
[Zhai & Lafferty 06] C. Zhai and J. Lafferty, A risk minimization framework for information retrieval, Information Processing and Management, 42(1), Jan. 2006, pages 31-55.
[Zhang et al. 02] Y. Zhang , J. Callan, and Thomas P. Minka, Novelty and redundancy detection in adaptive filtering. In Proceedings of SIGIR 2002, 81-88
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 80
Roadmap• This lecture: systematic review of language models
for IR
• Next lecture: formal retrieval frameworks