Assessing The Retrieval

Assessing The Retrieval

A.I Lab2007.01.20박동훈

Contents

• 4.1 Personal Assessment of Relevance• 4.2 Extending the Dialog with RelFbk• 4.3 Aggregated Assessment : Search Engi

ne Performance• 4.4 RAVE : A Relevance Assessment Vehicl

e• 4.5 Summary

4.1 Personal Assessment of Relevance

• 4.1.1 Cognitive Assumptions– Users trying to do ‘object recognition’– Comparison with respect to prototypic docum

ent– Reliability of user opinions?– Relevance Scale– RelFbk is nonmetric

Relevance Scale

• Users naturally provides only preference information

• Not(metric) measurement of how relevant a retrieved document is!

RelFbk is nonmetric

4.2 Extending the Dialog with RelFbk

RelFbk Labeling of the Retr Set

Query Session, Linked by RelFbk

4.2.1 Using RelFbk for Query Refinment

4.2.2 Document Modifications due to RelFbk

• Fig 4.7• Change

documents!?• More/less the

query that successfully / un matches them

4.3 Aggregated Assessment : Search Engine Performance

• 4.3.1 Underlying Assumptions – RelFbk(q,di) assessments independent– Users’ opinions will all agree with single ‘omni

scient’ expert’s

4.3.2 Consensual relevance

Consensually

relevant

4.3.4 Basic Measures

• Relevant versus Retrieved Sets

Contingency table

Relevant Relevant

Retrieved

Retrieved

RelRetr

NNRet

NRet

NRel NNRel NDoc

RelRetr

RelRetr RelRetr

• NRel : the number of relevant documents

• NNRel : the number of irrelevant documents

• NDoc : the total number of documents

• NRet : the number of retrieved documents

• NNRet : the number of documents not retrieved

4.3.4 Basic Measures (cont)

•

•

Ret

RelRetPrecision

Ret

RelRetRe

call

4.3.4 Basic Measures (cont)

• Ret

RelRetFallout

4.3.5 Ordering the Retr set

• Each document assigned hitlist rank Rank(di)• Descending Match(q,di)• Rank(di)<Rank(dj) ⇔ Match(q,di)>Match(q,dj)

– Rank(di)<Rank(dj) ⇔ Pr(Rel(di))>Pr(Rel(dj))

• Coordination level : document’s rank in Retr– Number of keywords shared by doc and query

• Goal:Probability Ranking Principle

• A tale of tworetrievals

Query1 Query2

Recall/precision curveQuery1

Recall/precision curveQuery1

Retrieval envelope

4.3.6 Normalized recall

ri : i 번째 relevant doc 의 hitlist rank

Worst

Best

4.3.8 One-Parameter Criteria

• Combining recall and precision• Classification accuracy• Sliding ratio• Point alienation

Combining recall and precision

• F-measure– [Jardine & van Rijsbergen71]– [Lewis&Gale94]

• Effectiveness– [vanRijsbergen, 1979]

• E=1-F, α=1/(β2+1)• α=0.5=>harmonic mean of

precision & recall

RrecallPrecisionβ

Recall*n1)Precisio(βF

2

2

β

1

Recall

1

Precision1E

Classification accuracy

• accuracy

• Correct identification of relevant and irrelevant

NDoc

RelRetrRelRetr

Sliding ratio

• Imagine a nonbinary, metric Rel(di) measure• Rank1, Rank2 computed by two separate system

s

Point alienation

• Developed to measure human preference data• Capturing fundamental nonmetric nature of RelFb

k

4.3.9 Test corpora

• More data required for “test corpus”• Standard test corpora• TREC:Text Retrieval Evaluation Conference• TREC’s refined queries• TREC constantly expanding, refining tasks

More data required for “test corpus”

• Documents• Queries• Relevance assessments Rel(q,d)• Perhaps other data too

– Classification data (Reuters)– Hypertext graph structure (EB5)

Standard test corpora

TREC constantly expanding,refining tasks

• Ad hoc queries tasks• Routing/filtering task• Interactive task

Other Measure

• Expected search length (ESL)– Length of “path” as user walks down HitList

– ESL=Num. irrelevant documents before each relevant document

– ESL for random retrieval

– ESL reduction factor

4.5 Summary

• Discussed both metric and nonmetric relevance feedback

• The difficulties in getting users to provide relevance judgments for documents in the retrieved set

• Quantified several measures of system perfomance