28
USING THE PAST TO SCORE THE PRESENT: EXTENDING TERM WEIGHTING MODELS WITH REVISION HISTORY ANALYSIS Ablimit Aji, Yu Wang Eugene Agichtein, Evgeniy Gabrilovich 1 Oct. 28, 2010

Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich

  • Upload
    halia

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich. Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis. Oct. 28, 2010. Revisions of โ€œTopologyโ€ on Wikipedia. 1 st revision:. 250 th revision:. Current revision:. - PowerPoint PPT Presentation

Citation preview

Page 1: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

1

USING THE PAST TO SCORE THE PRESENT: EXTENDING TERM WEIGHTING MODELS

WITH REVISION HISTORY ANALYSISAblimit Aji, Yu Wang

Eugene Agichtein, Evgeniy Gabrilovich

Oct. 28, 2010

Page 2: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

2

Revisions of โ€œTopologyโ€ on Wikipedia

1st revision:

250th revision:

Current revision:

Page 3: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

3

Observable Document Generation Process

In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Roughly speaking, topology is the study of geometric objects without considering their dimensions.

In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Topology is also concerned with the study of the so called topological properties of figures, that is to say properties that does not change under a bicontinuous one-to-one transformation (call homeomorphisms

95th revision 96th revision

#i#i-1

Page 4: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

4

How Revision History Analysis Could Help Retrieval

Revision History Analysis

Page 5: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

5

Selected Prior Work

โ€ข J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. In Proc. of WSDM,2010.

โ€ข M. Efron. Linear time series models for term weighting in information retrieval. JASIST, 2010.

โ€ข J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In CIKM, New York, NY, USA, 2009.

Page 6: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

6

Revision History Analysis (RHA)

RHA redefines term frequency (TF):- TF is a key indicator of document relevance- TF can be naturally integrated into ranking models

๐‘† (๐‘„ ,๐ท )=โˆ‘๐‘ก ๐œ–๐‘„

๐ผ๐ท๐น (๐‘ก ) โˆ™๐‘‡๐น (๐‘ก ,๐ท ) โˆ™ (๐‘˜1+1 )

๐‘‡๐น (๐‘ก ,๐ท )+๐‘˜1(1โˆ’๐‘+๐‘โˆ™|๐ท|๐‘Ž๐‘ฃ๐‘”๐‘‘๐‘™ )

๐‘† (๐‘„ ,๐ท )=๐ท ยฟ

BM25

Language Model

Page 7: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

7

Model 1: Steady growth

Topology (from the Greek ฯ„ฯŒฯ€ฮฟฯ‚, โ€œplaceโ€, and ฮปฯŒฮณฮฟฯ‚, โ€œstudyโ€) is a major area of mathematics concerned with spatial properties that are preserved under continuous deformations of objects, for exampleโ€ฆ..basic examples include compactness and connectedness

Topology, in mathematics, is both a structure used to capture the notions of continuity, connectedness and convergence, and the name of the branch of mathematics which studies these.

First revision

Current version

Page 8: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

8

Model 1 (continued)

Page 9: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

9

RHA Global Model: definition

Define the term frequency over the whole document generation processโ€“ a document grows steadily over timeโ€“ a term is relatively important if it appears in the early

revisions.

๐‘‡๐น ๐‘”๐‘™๐‘œ๐‘๐‘Ž๐‘™ (๐‘ก ,๐‘‘)=โˆ‘๐‘—=1

๐‘› ๐‘ (๐‘ก ,๐‘ฃ ๐‘—)

๐‘—๐›ผ

Frequency of term in revision

Decay factor

Page 10: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

10

Butโ€ฆ Some pages are different: โ€œAvatar(2009 film)โ€

1st revision:

500th revision:

Current revision:

Page 11: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

11

Model 2: Bursty Growth

TimeTerm Frequency

Document Lengthโ€œPandoraโ€ โ€œJames Cameronโ€

Nov. 2009 9 23 2576Dec. 2009 25 50 6306

Month (2009) Jul. Aug. Sep. Oct Nov. Dec.Edit Activity 89 224 67 154 232 1892

First photo & trailer released Movie released

Burst of Document (Length) & Change of Term Frequency

Burst of Edit Activity & Associated Events

Global Model might be insufficient

Page 12: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

12

RHA Burst Model: Definition

โ€ข A burst resets the decay clock for a term.โ€ข The weight will decrease after a burst.

๐‘‡๐น ๐‘๐‘ข๐‘Ÿ๐‘ ๐‘ก (๐‘ก ,๐‘‘ )=โˆ‘๐‘—=1

๐‘š

โˆ‘๐‘˜=๐‘ ๐‘—

๐‘› ๐‘ (๐‘ก ,๐‘ฃ๐‘˜)

(๐‘˜โˆ’๐‘ ๐‘—+1)๐›ฝ

Frequency of term in revision

Decay factor for jth Burst

Page 13: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

13

Burst Detection (1): Content-based

Relative content change potential burst

Content-based Burst for โ€œAvatarโ€

Page 14: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

14

Burst Detection (2): Activity Based

Intensive edit activity potential bursts

Activity-based Burst for โ€œAvatarโ€

Average revision counts

Deviation

Page 15: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

15

Burst Detection (3): Combined Model

Page 16: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

16

Putting it All Together: RHA Term Frequency--Combining global model and burst model

๐‘‡๐น h๐‘Ÿ ๐‘Ž (๐‘ก ,๐ท )=๐œ†1 โˆ™๐‘‡๐น๐‘” (๐‘ก ,๐ท )+๐œ†2 โˆ™๐‘‡๐น ๐‘ (๐‘ก ,๐ท )+๐œ†3 โˆ™๐‘‡๐น (๐‘ก ,๐ท )RHA Term Frequency:

ndicate the weights of RHA global model, burst model and original term frequency (probability).

๐œ†1+๐œ†2+๐œ†3=1

Page 17: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

17

Integrating RHA into Retrieval Models

๐‘† (๐‘„ ,๐ท )=โˆ‘๐‘ก ๐œ–๐‘„

๐ผ๐ท๐น (๐‘ก ) โˆ™๐‘‡๐น (๐‘ก ,๐ท ) โˆ™ (๐‘˜1+1 )

๐‘‡๐น (๐‘ก ,๐ท )+๐‘˜1(1โˆ’๐‘+๐‘โˆ™|๐ท|๐‘Ž๐‘ฃ๐‘”๐‘‘๐‘™ )

BM25

๐‘† (๐‘„ ,๐ท )=๐ท ยฟStatistical Language Models

๐‘‡๐น h๐‘Ÿ ๐‘Ž (๐‘ก ,๐ท )

๐‘‡๐น h๐‘Ÿ ๐‘Ž (๐‘ก ,๐ท )

๐‘ƒ h๐‘Ÿ ๐‘Ž (๐‘ก ,๐ท )

+ RHA

+ RHA

RHA Term Probability:

๐‘ƒ h๐‘Ÿ ๐‘Ž (๐‘ก ,๐ท )=๐œ†1 โˆ™๐‘ƒ๐‘” (๐‘ก ,๐ท )+๐œ†2 โˆ™๐‘ƒ๐‘ (๐‘ก ,๐ท )+๐œ†3 โˆ™๐‘ƒ (๐‘ก ,๐ท )

Page 18: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

18

Experimental Setup

Page 19: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

19

Datasets

INEX: well established forum for structured retrieval tasks (based on Wikipedia collection)TREC: performance comparison on different set of queries and general applicability

INEX 65 topic

Top 1000 retrieved articles

1000 revisions for each article Corpus for INEX

TREC 68 topic

Top 1000 retrieved articles

1000 revisions for each article Corpus for TREC

WikiDump

Page 20: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

20

Results

Page 21: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

21

INEX Results

Model bpref MAP R-precision

BM25 0.354 0.354 0.314

BM25+RHA 0.375 (+5.93%) 0.360 (+1.69%) 0.337 (+7.32%)

LM 0.357 0.370 0.348

LM+RHA 0.372 (+4.20%) 0.378 (+2.16%) 0.359 (+3.16%)

Parameters tuned on INEX query Set

BM25: , LM: ,

Page 22: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

22

TREC Results

Model bpref MAP NDCGBM25 0.524 0.548 0.634BM25+RHA 0.547** (+4.39%) 0.568 ** (+3.65%) 0.656** (+3.47%)LM 0.527 0.556 0.645LM+RHA 0.532 (+0.95%) 0.567 (+1.98%) 0.653 (+1.24%)

parameters tuned on INEX query Set, ** indicates statistically significant differences @ the 0.01 significance level with two tailed paired t-test

Lab members manually labeled top 20 results for each topic

BM25: , LM: ,

Page 23: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

23

Performance AnalysisPerformance Improvements on bpref for BM25+RHA over baseline (BM25)

INEX: significant improvement on 40% queriesTREC: significant improvement on 37% queriesEx: โ€œcircus acts skillsโ€ , โ€œolive oil health benefitโ€ (+20% BM25 ,+11% LM improvement)

INEX TREC

Page 24: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

24

Summary

o RHA captures importance signal from document authoring process.

o Introduced RHA term weighting approacho Natural integration with state of the art

retrieval models.o Consistent improvement over baseline

retrieval models

Page 25: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

25

Thank you!

Using the Past to Score the Present: Extending Term Weighting Models with Revision History Analysis

Ablimit Aji, Yu Wang, Eugene Agichtein, Evgeniy Gabrilovich

Research partially supported by:

Page 26: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

26

Query Sets and Evaluation Metrics

โ€ข Queries and Labels:โ€“ INEX: providedโ€“ TREC: subset of ad-hoc track

โ€ข Metrics: โ€“ Bpref (robust to missing judgments)โ€“ MAP: mean average precisionโ€“ R-prec: precision at position R

Page 27: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

27

RHA in Statistical Language Models

o (Global Model)

o (Burst Model)

Page 28: Ablimit Aji , Yu Wang Eugene  Agichtein ,  Evgeniy Gabrilovich

28

Cross validation on INEX

Model bpref MAP R-precisionBM25 0.307 0.281 0.324BM25+RHA 0.312 (+1.63%) 0.291 (+3.56%) 0.320 (-1.23%)LM 0.311 0.284 0.348LM+RHA 0.338 (+8.68%) 0.298 (+4.93%) 0.359 (+0.61%)

5-fold cross validation on INEX 2008 query Set

Model bpref MAP R-precision

BM25 0.354 0.354 0.314

BM25+RHA 0.363 (+2.54%) 0.348 (-1.70%) 0.333 (+6.05%)

LM 0.357 0.370 0.348

LM+RHA 0.366 (+2.52%) 0.375 (+1.35%) 0.352 (+1.15%)

5-fold cross validation on INEX 2009 query Set