Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich

1

USING THE PAST TO SCORE THE PRESENT: EXTENDING TERM WEIGHTING MODELS

WITH REVISION HISTORY ANALYSISAblimit Aji, Yu Wang

Eugene Agichtein, Evgeniy Gabrilovich

Oct. 28, 2010

2

Revisions of “Topology” on Wikipedia

1st revision:

250th revision:

Current revision:

3

Observable Document Generation Process

In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Roughly speaking, topology is the study of geometric objects without considering their dimensions.

In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Topology is also concerned with the study of the so called topological properties of figures, that is to say properties that does not change under a bicontinuous one-to-one transformation (call homeomorphisms

95th revision 96th revision

#i#i-1

4

How Revision History Analysis Could Help Retrieval

Revision History Analysis

5

Selected Prior Work

• J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. In Proc. of WSDM,2010.

• M. Efron. Linear time series models for term weighting in information retrieval. JASIST, 2010.

• J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In CIKM, New York, NY, USA, 2009.

6

Revision History Analysis (RHA)

RHA redefines term frequency (TF):- TF is a key indicator of document relevance- TF can be naturally integrated into ranking models

𝑆 (𝑄 ,𝐷 )=∑𝑡 𝜖𝑄

𝐼𝐷𝐹 (𝑡 ) ∙𝑇𝐹 (𝑡 ,𝐷 ) ∙ (𝑘1+1 )

𝑇𝐹 (𝑡 ,𝐷 )+𝑘1(1−𝑏+𝑏∙|𝐷|𝑎𝑣𝑔𝑑𝑙 )

𝑆 (𝑄 ,𝐷 )=𝐷 ¿

BM25

Language Model

7

Model 1: Steady growth

Topology (from the Greek τόπος, “place”, and λόγος, “study”) is a major area of mathematics concerned with spatial properties that are preserved under continuous deformations of objects, for example…..basic examples include compactness and connectedness

Topology, in mathematics, is both a structure used to capture the notions of continuity, connectedness and convergence, and the name of the branch of mathematics which studies these.

First revision

Current version

8

Model 1 (continued)

9

RHA Global Model: definition

Define the term frequency over the whole document generation process– a document grows steadily over time– a term is relatively important if it appears in the early

revisions.

𝑇𝐹 𝑔𝑙𝑜𝑏𝑎𝑙 (𝑡 ,𝑑)=∑𝑗=1

𝑛 𝑐 (𝑡 ,𝑣 𝑗)

𝑗𝛼

Frequency of term in revision

Decay factor

10

But… Some pages are different: “Avatar(2009 film)”

1st revision:

500th revision:

Current revision:

11

Model 2: Bursty Growth

TimeTerm Frequency

Document Length“Pandora” “James Cameron”

Nov. 2009 9 23 2576Dec. 2009 25 50 6306

Month (2009) Jul. Aug. Sep. Oct Nov. Dec.Edit Activity 89 224 67 154 232 1892

First photo & trailer released Movie released

Burst of Document (Length) & Change of Term Frequency

Burst of Edit Activity & Associated Events

Global Model might be insufficient

12

RHA Burst Model: Definition

• A burst resets the decay clock for a term.• The weight will decrease after a burst.

𝑇𝐹 𝑏𝑢𝑟𝑠𝑡 (𝑡 ,𝑑 )=∑𝑗=1

𝑚

∑𝑘=𝑏 𝑗

𝑛 𝑐 (𝑡 ,𝑣𝑘)

(𝑘−𝑏 𝑗+1)𝛽

Frequency of term in revision

Decay factor for jth Burst

13

Burst Detection (1): Content-based

Relative content change potential burst

Content-based Burst for “Avatar”

14

Burst Detection (2): Activity Based

Intensive edit activity potential bursts

Activity-based Burst for “Avatar”

Average revision counts

Deviation

15

Burst Detection (3): Combined Model

16

Putting it All Together: RHA Term Frequency--Combining global model and burst model

𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )=𝜆1 ∙𝑇𝐹𝑔 (𝑡 ,𝐷 )+𝜆2 ∙𝑇𝐹 𝑏 (𝑡 ,𝐷 )+𝜆3 ∙𝑇𝐹 (𝑡 ,𝐷 )RHA Term Frequency:

ndicate the weights of RHA global model, burst model and original term frequency (probability).

𝜆1+𝜆2+𝜆3=1

17

Integrating RHA into Retrieval Models

𝑆 (𝑄 ,𝐷 )=∑𝑡 𝜖𝑄

𝐼𝐷𝐹 (𝑡 ) ∙𝑇𝐹 (𝑡 ,𝐷 ) ∙ (𝑘1+1 )

𝑇𝐹 (𝑡 ,𝐷 )+𝑘1(1−𝑏+𝑏∙|𝐷|𝑎𝑣𝑔𝑑𝑙 )

BM25

𝑆 (𝑄 ,𝐷 )=𝐷 ¿Statistical Language Models

𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )

𝑇𝐹 h𝑟 𝑎 (𝑡 ,𝐷 )

𝑃 h𝑟 𝑎 (𝑡 ,𝐷 )

+ RHA

+ RHA

RHA Term Probability:

𝑃 h𝑟 𝑎 (𝑡 ,𝐷 )=𝜆1 ∙𝑃𝑔 (𝑡 ,𝐷 )+𝜆2 ∙𝑃𝑏 (𝑡 ,𝐷 )+𝜆3 ∙𝑃 (𝑡 ,𝐷 )

18

Experimental Setup

19

Datasets

INEX: well established forum for structured retrieval tasks (based on Wikipedia collection)TREC: performance comparison on different set of queries and general applicability

INEX 65 topic

Top 1000 retrieved articles

1000 revisions for each article Corpus for INEX

TREC 68 topic

Top 1000 retrieved articles

1000 revisions for each article Corpus for TREC

WikiDump

20

Results

21

INEX Results

Model bpref MAP R-precision

BM25 0.354 0.354 0.314

BM25+RHA 0.375 (+5.93%) 0.360 (+1.69%) 0.337 (+7.32%)

LM 0.357 0.370 0.348

LM+RHA 0.372 (+4.20%) 0.378 (+2.16%) 0.359 (+3.16%)

Parameters tuned on INEX query Set

BM25: , LM: ,

22

TREC Results

Model bpref MAP NDCGBM25 0.524 0.548 0.634BM25+RHA 0.547** (+4.39%) 0.568 ** (+3.65%) 0.656** (+3.47%)LM 0.527 0.556 0.645LM+RHA 0.532 (+0.95%) 0.567 (+1.98%) 0.653 (+1.24%)

parameters tuned on INEX query Set, ** indicates statistically significant differences @ the 0.01 significance level with two tailed paired t-test

Lab members manually labeled top 20 results for each topic

BM25: , LM: ,

23

Performance AnalysisPerformance Improvements on bpref for BM25+RHA over baseline (BM25)

INEX: significant improvement on 40% queriesTREC: significant improvement on 37% queriesEx: “circus acts skills” , “olive oil health benefit” (+20% BM25 ,+11% LM improvement)

INEX TREC

24

Summary

o RHA captures importance signal from document authoring process.

o Introduced RHA term weighting approacho Natural integration with state of the art

retrieval models.o Consistent improvement over baseline

retrieval models

25

Thank you!

Using the Past to Score the Present: Extending Term Weighting Models with Revision History Analysis

Ablimit Aji, Yu Wang, Eugene Agichtein, Evgeniy Gabrilovich

Research partially supported by:

26

Query Sets and Evaluation Metrics

• Queries and Labels:– INEX: provided– TREC: subset of ad-hoc track

• Metrics: – Bpref (robust to missing judgments)– MAP: mean average precision– R-prec: precision at position R

27

RHA in Statistical Language Models

o (Global Model)

o (Burst Model)

28

Cross validation on INEX

Model bpref MAP R-precisionBM25 0.307 0.281 0.324BM25+RHA 0.312 (+1.63%) 0.291 (+3.56%) 0.320 (-1.23%)LM 0.311 0.284 0.348LM+RHA 0.338 (+8.68%) 0.298 (+4.93%) 0.359 (+0.61%)

5-fold cross validation on INEX 2008 query Set

Model bpref MAP R-precision

BM25 0.354 0.354 0.314

BM25+RHA 0.363 (+2.54%) 0.348 (-1.70%) 0.333 (+6.05%)

LM 0.357 0.370 0.348

LM+RHA 0.366 (+2.52%) 0.375 (+1.35%) 0.352 (+1.15%)

5-fold cross validation on INEX 2009 query Set

Documents

Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich