140
IR MODELING 검색 시스템의 모델링 NAVER 임민섭

IR Modeling(검색 시스템의 모델링)

Embed Size (px)

Citation preview

Page 1: IR Modeling(검색 시스템의 모델링)

IR�MODELING�검색�시스템의�모델링

NAVER�임민섭

Page 2: IR Modeling(검색 시스템의 모델링)

PROBABILISTIC�RELEVANCE�FRAMEWORK�:�BM25�AND�BEYOND

• INTRODUCTION�

IR의�classical한�확률�모델링은�보통��

“질의와�문서의�probability�of�relevance를�구하고�주어진�질의에�대해�문서를�내림차순으로�ranking한�것”�

이며,�가장�잘�알려진�term-weighting�및�document�scoring�함수로�BM25가�있다.�

Page 3: IR Modeling(검색 시스템의 모델링)

DEVELOPMENT�OF�THE�BASIC�MODEL

• Relevance를�다시�한�번�짚고�갑시다.��

-�Information�need에�대해�document가�가지는�(유저가�판단하는)�연관성,�혹은�적합성�

•몇가지�assumption도�넣어보고!�

-�다른�문서�없이도,�information�need만으로�판단되는�property�이다.�

-�Rel,�nonrel�두가지로�결정되는�binary�property.�

-�universal하게�받아들여지지�않을�assumption일�수도�있지만�relevant를�“유저가�보고�싶어하는”을�의미한다고�하면�괜찮은�notion일�듯.

Page 4: IR Modeling(검색 시스템의 모델링)

DEVELOPMENT�OF�THE�BASIC�MODEL

• Relevance를�다시�한�번�짚고�갑시다.��

-�Information�need에�대해�document가�가지는�(유저가�판단하는)�연관성,�혹은�적합성�

•몇가지�assumption도�넣어보고!�

-�다른�문서�없이도,�information�need만으로�판단되는�property�이다.�

-�Rel,�nonrel�두가지로�결정되는�binary�property.�

-�universal하게�받아들여지지�않을�assumption일�수도�있지만�relevant를�“유저가�보고�싶어하는”을�의미한다고�하면�괜찮은�notion일�듯.

Page 5: IR Modeling(검색 시스템의 모델링)

PROBABILITY�RANKING�PRINCIPLE

•시스템이�각�document가�가지는�relevance�property를�사전에�알�수�없으니까�확률로�표현하는�것이�베스트!�

• Document와�query에�대해�시스템이�알고있는�정보들로�relevance의�확률을�판단�할�수�있다!�

• “If�retrieved�documents�are�ordered�by�decreasing�probability�of�relevance�on�the�data�available,�then�the�system’s�effectiveness�is�the�best�that�can�be�obtained�for�the�data.”

Page 6: IR Modeling(검색 시스템의 모델링)

PROBABILITY�RANKING�PRINCIPLE

•시스템이�각�document가�가지는�relevance�property를�사전에�알�수�없으니까�확률로�표현하는�것이�베스트!�

• Document와�query에�대해�시스템이�알고있는�정보들로�relevance의�확률을�판단�할�수�있다!�

• “If�retrieved�documents�are�ordered�by�decreasing�probability�of�relevance�on�the�data�available,�then�the�system’s�effectiveness�is�the�best�that�can�be�obtained�for�the�data.”

Page 7: IR Modeling(검색 시스템의 모델링)

PROBABILITY�RANKING�PRINCIPLE

•시스템이�각�document가�가지는�relevance�property를�사전에�알�수�없으니까�확률로�표현하는�것이�베스트!�

• Document와�query에�대해�시스템이�알고있는�정보들로�relevance의�확률을�판단�할�수�있다!�

• “If�retrieved�documents�are�ordered�by�decreasing�probability�of�relevance�on�the�data�available,�then�the�system’s�effectiveness�is�the�best�that�can�be�obtained�for�the�data.”

Page 8: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

•먼저�문서�d를�term�frequency의�벡터로�표현해�보자:�

•그리고�우리가�원하는�“질의�q�와�문서�d�가�relevant할�확률은:�

•우리는��probability가�아니라�ranking에�관심이�있기�때문에�ranking을�보존하는�odds�of�relevance를�랭킹�함수로�사용할�수�있겠다:�

P (R = rel|q, d)

O(R = rel|d, q) = P (R = rel|d, q)P (R = rel|d, q)

d = (tf 1, tf 2, ..., tf |V |)

Page 9: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

•먼저�문서�d를�term�frequency의�벡터로�표현해�보자:�

•그리고�우리가�원하는�“질의�q�와�문서�d�가�relevant할�확률은:�

•우리는��probability가�아니라�ranking에�관심이�있기�때문에�ranking을�보존하는�odds�of�relevance를�랭킹�함수로�사용할�수�있겠다:�

P (R = rel|q, d)

O(R = rel|d, q) = P (R = rel|d, q)P (R = rel|d, q)

d = (tf 1, tf 2, ..., tf |V |)

Page 10: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

•먼저�문서�d를�term�frequency의�벡터로�표현해�보자:�

•그리고�우리가�원하는�“질의�q�와�문서�d�가�relevant할�확률은:�

•우리는��probability가�아니라�ranking에�관심이�있기�때문에�ranking을�보존하는�odds�of�relevance를�랭킹�함수로�사용할�수�있겠다:�

P (R = rel|q, d)

O(R = rel|d, q) = P (R = rel|d, q)P (R = rel|d, q)

d = (tf 1, tf 2, ..., tf |V |)

Page 11: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

O(R = rel|d, q) = P (R = rel|d, q)P (R = rel|d, q)

=

P (R=rel|q)P (d|R=rel,q)P (d|q)

P (R=rel|q)P (d|R=rel,q)P (d|q)

by Bayes’ rule

=P (R = rel|q)P (d|R = rel, q)

P (R = rel|q)P (d|R = rel, q)

=P (d|R = rel, q)

P (d|R = rel, q)

Page 12: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

O(R = rel|d, q) = P (R = rel|d, q)P (R = rel|d, q)

=

P (R=rel|q)P (d|R=rel,q)P (d|q)

P (R=rel|q)P (d|R=rel,q)P (d|q)

by Bayes’ rule

=P (R = rel|q)P (d|R = rel, q)

P (R = rel|q)P (d|R = rel, q)

=P (d|R = rel, q)

P (d|R = rel, q)

Page 13: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

O(R = rel|d, q) = P (R = rel|d, q)P (R = rel|d, q)

=

P (R=rel|q)P (d|R=rel,q)P (d|q)

P (R=rel|q)P (d|R=rel,q)P (d|q)

by Bayes’ rule

=P (R = rel|q)P (d|R = rel, q)

P (R = rel|q)P (d|R = rel, q)

=P (d|R = rel, q)

P (d|R = rel, q)

Page 14: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

O(R = rel|d, q) = P (R = rel|d, q)P (R = rel|d, q)

=

P (R=rel|q)P (d|R=rel,q)P (d|q)

P (R=rel|q)P (d|R=rel,q)P (d|q)

by Bayes’ rule

=P (R = rel|q)P (d|R = rel, q)

P (R = rel|q)P (d|R = rel, q)

=P (d|R = rel, q)

P (d|R = rel, q)

positive constant for a given query

Page 15: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

O(R = rel|d, q) = P (R = rel|d, q)P (R = rel|d, q)

=

P (R=rel|q)P (d|R=rel,q)P (d|q)

P (R=rel|q)P (d|R=rel,q)P (d|q)

by Bayes’ rule

=P (R = rel|q)P (d|R = rel, q)

P (R = rel|q)P (d|R = rel, q)positive constant for

a given query

/qP (d|R = rel, q)

P (d|R = rel, q)

Page 16: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

/qP (d|R = rel, q)

P (d|R = rel, q)

=

|V |Y

i=1

P (ti|R = rel, q)

P (ti|R = rel, q)

term frequency가 0인 애들과 아닌 애들을 나눠서 적어보면

=Y

ti>0

P (ti|R = rel, q)

P (ti|R = rel, q)

Y

ti=0

P (ti|R = rel, q)

P (ti|R = rel, q)

Page 17: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

/qP (d|R = rel, q)

P (d|R = rel, q)

문서의 벡터 표현방식 & 조건부 독립

term frequency가 0인 애들과 아닌 애들을 나눠서 적어보면

=Y

ti>0

P (ti|R = rel, q)

P (ti|R = rel, q)

Y

ti=0

P (ti|R = rel, q)

P (ti|R = rel, q)

=

|V |Y

i=1

P (tfi |R = rel, q)

P (tfi |R = rel, q)

Page 18: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

/qP (d|R = rel, q)

P (d|R = rel, q)

문서의�벡터�표현방식�&�조건부�독립

term�frequency가�0인�애들과�아닌�애들을�나눠서�적어보면

=Y

ti>0

P (ti|R = rel, q)

P (ti|R = rel, q)

Y

ti=0

P (ti|R = rel, q)

P (ti|R = rel, q)

=

|V |Y

i=1

P (tfi |R = rel, q)

P (tfi |R = rel, q)

Page 19: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

/qP (d|R = rel, q)

P (d|R = rel, q)

term�frequency가�0인�애들과�아닌�애들을�나눠서�적어보면

문서의�벡터�표현방식�&�조건부�독립=

|V |Y

i=1

P (tfi |R = rel, q)

P (tfi |R = rel, q)

=Y

tfi>0

P (tfi |R = rel, q)

P (tfi |R = rel, q)

Y

tfi=0

P (0|R = rel, q)

P (0|R = rel, q)

Page 20: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

• 한가지�assumption을�더�넣어보자:�

질의에�포함되지�않은�term들은�relevant한�문서와�non-relevant한�문서에�동일한�확률로�나타난다.�즉,�

• 그래서 query term들이 아닌 애들은 cancel out 되고 query term 들로만 식을 다시 적어보면:

=Y

ti2Q,ti>0

P (ti|R = rel, q)

P (ti|R = rel, q)

Y

ti2Q,ti=0

P (ti|R = rel, q)

P (ti|R = rel, q)

P (tfi |R = rel, q) = P (tfi |R = rel, q)

Page 21: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

• 그래서�query�term들이�아닌�애들은�cancel�out�되고�query�term�들로만�식을�다시�적어보면:

=Y

ti2Q,ti>0

P (ti|R = rel, q)

P (ti|R = rel, q)

Y

ti2Q,ti=0

P (ti|R = rel, q)

P (ti|R = rel, q)

• 한가지�assumption을�더�넣어보자:�

질의에�포함되지�않은�term들은�relevant한�문서와�non-relevant한�문서에�동일한�확률로�나타난다.�즉,�P (tfi |R = rel, q) = P (tfi |R = rel, q)

Page 22: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

• 그래서�query�term들이�아닌�애들은�cancel�out�되고�query�term�들로만�식을�다시�적어보면:

=Y

ti2Q,tfi>0

P (tfi |R = rel, q)

P (tfi |R = rel, q)

Y

ti2Q,tfi=0

P (0|R = rel, q)

P (0|R = rel, q)

• 한가지�assumption을�더�넣어보자:�

질의에�포함되지�않은�term들은�relevant한�문서와�non-relevant한�문서에�동일한�확률로�나타난다.�즉,�P (tfi |R = rel, q) = P (tfi |R = rel, q)

Page 23: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

• 오른쪽 product에 인 term들에 대해서도 곱하고 왼쪽 product 에서 나눠주면 같은 식이 된다:

ti > 0, ti 2 Q

=Y

ti2Q,ti>0

P (ti|R = rel, q)P (0|R = rel, q)

P (ti|R = rel, q)P (0|R = rel, q)

·Y

ti2Q

P (0|R = rel, q)

P (0|R = rel, q)

=Y

ti2Q,tfi>0

P (tfi |R = rel, q)

P (tfi |R = rel, q)

Y

ti2Q,tfi=0

P (0|R = rel, q)

P (0|R = rel, q)

Page 24: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

• 오른쪽�product에��������������������인�term들에�대해서도�곱하고�왼쪽�product�에서�나눠주면�같은�식이�된다:

=Y

ti2Q,ti>0

P (ti|R = rel, q)P (0|R = rel, q)

P (ti|R = rel, q)P (0|R = rel, q)

·Y

ti2Q

P (0|R = rel, q)

P (0|R = rel, q)

=Y

ti2Q,tfi>0

P (tfi |R = rel, q)

P (tfi |R = rel, q)

Y

ti2Q,tfi=0

P (0|R = rel, q)

P (0|R = rel, q)

tfi > 0, ti 2 Q

Page 25: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

·Y

ti2Q

P (0|R = rel, q)

P (0|R = rel, q)

=Y

ti2Q,tfi>0

P (tfi |R = rel, q)

P (tfi |R = rel, q)

Y

ti2Q,tfi=0

P (0|R = rel, q)

P (0|R = rel, q)

=Y

ti2Q,tfi>0

P (tfi |R = rel, q)P (0|R = rel, q)

P (tfi |R = rel, q)P (0|R = rel, q)

• 오른쪽�product에��������������������인�term들에�대해서도�곱하고�왼쪽�product�에서�나눠주면�같은�식이�된다:

tfi > 0, ti 2 Q

Page 26: IR Modeling(검색 시스템의 모델링)

positive constant for a given query

RANKING�FUNCTION�FOR�QUERY�TERMS

·Y

ti2Q

P (0|R = rel, q)

P (0|R = rel, q)

=Y

ti2Q,tfi>0

P (tfi |R = rel, q)

P (tfi |R = rel, q)

Y

ti2Q,tfi=0

P (0|R = rel, q)

P (0|R = rel, q)

=Y

ti2Q,tfi>0

P (tfi |R = rel, q)P (0|R = rel, q)

P (tfi |R = rel, q)P (0|R = rel, q)

• 오른쪽�product에��������������������인�term들에�대해서도�곱하고�왼쪽�product�에서�나눠주면�같은�식이�된다:

tfi > 0, ti 2 Q

Page 27: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

Log함수는 ranking을 보존하니까:

/q log

0

@Y

ti2Q,ti>0

Ui(ti)

1

A =X

ti2Q,ti>0

log(Ui(ti))

) P (rel|q, d) =X

ti2Q,ti>0

wi

=Y

ti2Q,tfi>0

P (tfi |R = rel, q)P (0|R = rel, q)

P (tfi |R = rel, q)P (0|R = rel, q)

Page 28: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

Log함수는�ranking을�보존하니까:�

/q log

0

@Y

ti2Q,ti>0

Ui(ti)

1

A =X

ti2Q,ti>0

log(Ui(ti))

) P (rel|q, d) =X

ti2Q,ti>0

wi

=Y

ti2Q,tfi>0

P (tfi |R = rel, q)P (0|R = rel, q)

P (tfi |R = rel, q)P (0|R = rel, q)

Page 29: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

) P (rel|q, d) =X

ti2Q,ti>0

wi

=Y

ti2Q,tfi>0

P (tfi |R = rel, q)P (0|R = rel, q)

P (tfi |R = rel, q)P (0|R = rel, q)

/q log

0

@Y

ti2Q,tfi>0

Ui(ti)

1

A =X

ti2Q,tfi>0

log(Ui(ti))

Log함수는�ranking을�보존하니까:�

Page 30: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

=Y

ti2Q,tfi>0

P (tfi |R = rel, q)P (0|R = rel, q)

P (tfi |R = rel, q)P (0|R = rel, q)

/q log

0

@Y

ti2Q,tfi>0

Ui(ti)

1

A =X

ti2Q,tfi>0

log(Ui(ti))

) P (rel|q, d) /q

X

ti2Q,tfi>0

wi

Log함수는�ranking을�보존하니까:�

Page 31: IR Modeling(검색 시스템의 모델링)

RANKING�FUNCTION�FOR�QUERY�TERMS

•단점은?�이�ranking�function은�rank�order에만�focus를�두고있다.�경우에�따라�rank�order보다�각각의�document가�가지는�explicit�probability가�선호되는�때가�있다.��아쉽게도�위�ranking�함수는�explicit�probability를�나타낼�수가�없네…�

Page 32: IR Modeling(검색 시스템의 모델링)

DERIVED�MODELS

•앞서�봤던�random�variable�TF는�term�frequency�뿐만�아니라�document가�가지는�어떠한�성질이든�나타낸다고�생각하면�됩니다!��

•우선은�TF가�이진(binary)�property인�모델(document에�present/absent)을�살펴봅시다�

Page 33: IR Modeling(검색 시스템의 모델링)

DERIVED�MODELS

•앞서�봤던�random�variable�TF는�term�frequency�뿐만�아니라�document가�가지는�어떠한�성질이든�나타낸다고�생각하면�됩니다!��

•우선은�TF가�이진(binary)�property인�모델(document에�present/absent)을�살펴봅시다�

Page 34: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

• TF�가�binary�random�variable이므로�앞서�봤던�weight�함수는�이렇게�됩니다.�

w

BIMi = log

✓P (ti|rel, q)(1� P (ti|rel, q))P (ti|rel, q)(1� P (ti|rel, q))

◆이렇게

wi = log

✓P (tfi |rel, q)P (0|rel, q)P (tfi |rel, q)P (0|rel, q)

Page 35: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

• TF�가�binary�random�variable이므로�앞서�봤던�weight�함수는�이렇게�됩니다.�

이렇게

wi = log

✓P (tfi |rel, q)P (0|rel, q)P (tfi |rel, q)P (0|rel, q)

w

BIMi = log

✓P (tfi |rel, q)(1� P (tfi |rel, q))P (tfi |rel, q)(1� P (tfi |rel, q))

Page 36: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

•그런데�이�수식의�조건부가�relevance이니까�whole�collection에�대해�relevance�judge가�되어�있다고�가정을�하고�다음��notation들을�정의해�봅시다.

= size of the whole collection

= number of docs. in the collection containing t_i

= relevant set size

= number of judged relevant docs containing t_i

ni

R

ri

N

w

BIMi = log

✓P (tfi |rel, q)(1� P (tfi |rel, q))P (tfi |rel, q)(1� P (tfi |rel, q))

Page 37: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

•그런데�이�수식의�조건부가�relevance이니까�whole�collection에�대해�relevance�judge가�되어�있다고�가정을�하고�다음��notation들을�정의해�봅시다.

= size of the whole collection

= number of docs. in the collection containing t_i

= relevant set size

= number of judged relevant docs containing t_i

ni

R

ri

N

w

BIMi = log

✓P (tfi |rel, q)(1� P (tfi |rel, q))P (tfi |rel, q)(1� P (tfi |rel, q))

Page 38: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

•이렇게�했을�때��

직관적으로�이렇게�나오겠지만,�log�씌우면�weight값이�infinity로�치솟는�경우���������������������(예�R=�1,�r_i�=�0)가�생기므로�패스.��

이러한�문제점을�해결하기�위해서�분자/분모에�작은�상수(pseudo-count)를�추가해�봅시다.

= size of the whole collection

= number of docs. in the collection containing t_i

= relevant set size

= number of judged relevant docs containing t_i

ni

R

ri

N

w

BIMi = log

✓P (tfi |rel, q)(1� P (tfi |rel, q))P (tfi |rel, q)(1� P (tfi |rel, q))

P (tfi |rel, q) =riR, P (tfi |rel, q) =

ni � riN �R

Page 39: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

•이렇게�했을�때��

직관적으로�이렇게�나오겠지만,�log�씌우면�weight값이�infinity로�치솟는�경우���������������������(예�R=�1,�r_i�=�0)가�생기므로�패스.��

이러한�문제점을�해결하기�위해서�분자/분모에�작은�상수(pseudo-count)를�추가해�봅시다.

= size of the whole collection

= number of docs. in the collection containing t_i

= relevant set size

= number of judged relevant docs containing t_i

ni

R

ri

N

P (tfi |rel, q) =riR, P (tfi |rel, q) =

ni � riN �R

w

BIMi = log

✓P (tfi |rel, q)(1� P (tfi |rel, q))P (tfi |rel, q)(1� P (tfi |rel, q))

Page 40: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

그리고�이걸�위�weight�함수에�대입해�보면?

= size of the whole collection

= number of docs. in the collection containing t_i

= relevant set size

= number of judged relevant docs containing t_i

ni

R

ri

N

w

RSJi = log

✓(ri + 0.5)(N �R� ni + ri + 0.5)

(ni � ri + 0.5)(R� ri + 0.5)

w

BIMi = log

✓P (tfi |rel, q)(1� P (tfi |rel, q))P (tfi |rel, q)(1� P (tfi |rel, q))

P (tfi |rel, q) =riR

! ri + 0.5

R+ 1, P (tfi |rel, q) =

ni � riN �R

! ni � ri + 0.5

N �R+ 1

Page 41: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

그리고�이걸�위�weight�함수에�대입해�보면?

= size of the whole collection

= number of docs. in the collection containing t_i

= relevant set size

= number of judged relevant docs containing t_i

ni

R

ri

N

w

BIMi = log

✓P (tfi |rel, q)(1� P (tfi |rel, q))P (tfi |rel, q)(1� P (tfi |rel, q))

P (tfi |rel, q) =riR

! ri + 0.5

R+ 1, P (tfi |rel, q) =

ni � riN �R

! ni � ri + 0.5

N �R+ 1

w

RSJi = log

✓(ri + 0.5)(N �R� ni + ri + 0.5)

(ni � ri + 0.5)(R� ri + 0.5)

Page 42: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

•다음은�좀�더�현실적인�상황을�고려해�볼까!�whole�collection이�아닌�small�portion만�relevance�judge가�되어있다고�가정을�해보자.�

• judged�set에는�앞서�구했던�wRSJ로�weight값을�구하면�될�것이고.�judged가�되지�않은�set에는�한가지�assumption을�더�추가�:�judged가�아닌�document는�무조건�nonrel�:�complement�method.�

•새로운�assumption으로�notion�업데이트를�해�보면:

= size of the whole collection

= number of docs. in the collection containing t_ini

N

Page 43: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

•다음은�좀�더�현실적인�상황을�고려해�볼까!�whole�collection이�아닌�small�portion만�relevance�judge가�되어있다고�가정을�해보자.�

• judged�set에는�앞서�구했던�wRSJ로�weight값을�구하면�될�것이고.�judged가�되지�않은�set에는�한가지�assumption을�더�추가�:�judged가�아닌�document는�무조건�nonrel�:�complement�method.�

•새로운�assumption으로�notion�업데이트를�해�보면:

= size of the whole collection

= number of docs. in the collection containing t_ini

N

Page 44: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

•다음은�좀�더�현실적인�상황을�고려해�볼까!�whole�collection이�아닌�small�portion만�relevance�judge가�되어있다고�가정을�해보자.�

• judged�set에는�앞서�구했던�wRSJ로�weight값을�구하면�될�것이고.�judged가�되지�않은�set에는�한가지�assumption을�더�추가�:�judged가�아닌�document는�무조건�nonrel�:�complement�method.�

•새로운�assumption으로�notion�업데이트를�해�보면:

= size of the judged sample

= number of docs. in the sample containing t_ini

N

Page 45: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

•실험적으로�봤을�때�complement�method가�judged�sample로�estimate한�것보다�결과가�좋다!�

•마지막으로�relevance에�대한�정보가�전혀�없다고�가정을�해보자.�Complement�method를�사용하면�:�모든�document가�query에�대해�non-relevant하다.�즉,�R=r_i=0.��그렇다면�weight�function은�다음과�같이�바뀐다!�(classical�idf�와�유사한�weight�함수)

Classical idf

w

IDFi = log

✓N � ni + 0.5

ni + 0.5

◆⇠ w

idfi = log

✓N

ni

Page 46: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

•실험적으로�봤을�때�complement�method가�judged�sample로�estimate한�것보다�결과가�좋다!�

•마지막으로�relevance에�대한�정보가�전혀�없다고�가정을�해보자.�Complement�method를�사용하면�:�모든�document가�query에�대해�non-relevant하다.�즉,�R=r_i=0.��그렇다면�weight�function은�다음과�같이�바뀐다!�(classical�idf�와�유사한�weight�함수)

Classical idf

w

IDFi = log

✓N � ni + 0.5

ni + 0.5

◆⇠ w

idfi = log

✓N

ni

Page 47: IR Modeling(검색 시스템의 모델링)

THE�BINARY�INDEPENDENCE�MODEL

•실험적으로�봤을�때�complement�method가�judged�sample로�estimate한�것보다�결과가�좋다!�

•마지막으로�relevance에�대한�정보가�전혀�없다고�가정을�해보자.�Complement�method를�사용하면�:�모든�document가�query에�대해�non-relevant하다.�즉,�R=r_i=0.��그렇다면�weight�function은�다음과�같이�바뀐다!�(classical�idf�와�유사한�weight�함수)

Classical idf

w

IDFi = log

✓N � ni + 0.5

ni + 0.5

◆⇠ w

idfi = log

✓N

ni

Page 48: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

•오,�그렇다면�이제�relevance�feedback(query�modification)이�가능하겠군!�Relevance에�대한�사전�정보가�전혀�없다면�w_IDF를�통해�term�weighting을�하고�relevance�judgement가�좀�생기면�w_RSJ를�사용해서�re-weighting을�한�뒤에�term�weight가�높은�순으로�query에�include�시키면�되겠다.�

• 근데�term�re-weighting은�검색�개선에�별로�효율적이지�않아..�이런�방식으로�include되는�term에는�noise가�많기도�하고..�좀�더�conservative한�방법�없나?�

• w_RSJ를�사용하면�rare�term들�가중치가�너무�높아질�거고..�물론�rare�term이�relevance와�상호관계가�높을�것�같긴�하지만�그런�term들�가진�doc이�많지�않아서�검색�결과�향상에�크게�도움은�안�될듯.�

• 그럼�term�inclusion이�전체적인�score에�미치는�영향도를�측정하는�weight를�살펴볼까?�offer�weight

Page 49: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

•오,�그렇다면�이제�relevance�feedback(query�modification)이�가능하겠군!�Relevance에�대한�사전�정보가�전혀�없다면�w_IDF를�통해�term�weighting을�하고�relevance�judgement가�좀�생기면�w_RSJ를�사용해서�re-weighting을�한�뒤에�term�weight가�높은�순으로�query에�include�시키면�되겠다.�

• 근데�term�re-weighting은�검색�개선에�별로�효율적이지�않아..�이런�방식으로�include되는�term에는�noise가�많기도�하고..�좀�더�conservative한�방법�없나?�

• w_RSJ를�사용하면�rare�term들�가중치가�너무�높아질�거고..�물론�rare�term이�relevance와�상호관계가�높을�것�같긴�하지만�그런�term들�가진�doc이�많지�않아서�검색�결과�향상에�크게�도움은�안�될듯.�

• 그럼�term�inclusion이�전체적인�score에�미치는�영향도를�측정하는�weight를�살펴볼까?�offer�weight

Page 50: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

•오,�그렇다면�이제�relevance�feedback(query�modification)이�가능하겠군!�Relevance에�대한�사전�정보가�전혀�없다면�w_IDF를�통해�term�weighting을�하고�relevance�judgement가�좀�생기면�w_RSJ를�사용해서�re-weighting을�한�뒤에�term�weight가�높은�순으로�query에�include�시키면�되겠다.�

• 근데�term�re-weighting은�검색�개선에�별로�효율적이지�않아..�이런�방식으로�include되는�term에는�noise가�많기도�하고..�좀�더�conservative한�방법�없나?�

• w_RSJ를�사용하면�rare�term들�가중치가�너무�높아질�거고..�물론�rare�term이�relevance와�상호관계가�높을�것�같긴�하지만�그런�term들�가진�doc이�많지�않아서�검색�결과�향상에�크게�도움은�안�될듯.�

• 그럼�term�inclusion이�전체적인�score에�미치는�영향도를�측정하는�weight를�살펴볼까?�offer�weight

Page 51: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

•오,�그렇다면�이제�relevance�feedback(query�modification)이�가능하겠군!�Relevance에�대한�사전�정보가�전혀�없다면�w_IDF를�통해�term�weighting을�하고�relevance�judgement가�좀�생기면�w_RSJ를�사용해서�re-weighting을�한�뒤에�term�weight가�높은�순으로�query에�include�시키면�되겠다.�

• 근데�term�re-weighting은�검색�개선에�별로�효율적이지�않아..�이런�방식으로�include되는�term에는�noise가�많기도�하고..�좀�더�conservative한�방법�없나?�

• w_RSJ를�사용하면�rare�term들�가중치가�너무�높아질�거고..�물론�rare�term이�relevance와�상호관계가�높을�것�같긴�하지만�그런�term들�가진�doc이�많지�않아서�검색�결과�향상에�크게�도움은�안�될듯.�

• 그럼�term�inclusion이�전체적인�score에�미치는�영향도를�측정하는�weight를�살펴볼까?�offer�weight

Page 52: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

⇡ riRwi

/q riwRSJi = OWRSJ

i

OWi = (P ((tfi)|rel)� P (tfi |rel)) · wi

⇡ P (tfi |rel) · wi because P (tfi |rel) � P (tfi |rel)

Page 53: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

⇡ riRwi

/q riwRSJi = OWRSJ

i

OWi = (P ((tfi)|rel)� P (tfi |rel)) · wi

⇡ P (tfi |rel) · wi because P (tfi |rel) � P (tfi |rel)

Page 54: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

⇡ riRwi

/q riwRSJi = OWRSJ

i

OWi = (P ((tfi)|rel)� P (tfi |rel)) · wi

⇡ P (tfi |rel) · wi because P (tfi |rel) � P (tfi |rel)

Page 55: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

⇡ riRwi

/q riwRSJi = OWRSJ

i

OWi = (P ((tfi)|rel)� P (tfi |rel)) · wi

⇡ P (tfi |rel) · wi because P (tfi |rel) � P (tfi |rel)

R is a positive constant

Page 56: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

⇡ riRwi

OWi = (P ((tfi)|rel)� P (tfi |rel)) · wi

⇡ P (tfi |rel) · wi because P (tfi |rel) � P (tfi |rel)

R is a positive constant

/q ri · wRSJi = OWRSJ

i

Page 57: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

•그렇다면�OW를�이용한�query�expansion�과정은�다음과�같겠구나:�

(1)�Relevant�문서들에게서�모든�term들을�추출해서�OW_RSJ로�ranking을�한다.�

(2)�ranking�된�리스트에서�첫�k�개를�query에�include�한다.�

•이�질의확장법은�BIM과�RSJ�weighting을�사용하는�모델이�대상이긴�하지만�BM25에도�나름�성공적이었다고�한다.�물론�단점도�있다.��이�질의확장법을�통해�유사어들을�질의에�포함시킬�수는�있지만�자연스럽게�query�term�independence�assumption이�약해진다는�것.��(지금까지�고려했던�weight�함수들은�이�독립�assumption에�기반을�두고�있다)

Page 58: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

•그렇다면�OW를�이용한�query�expansion�과정은�다음과�같겠구나:�

(1)�Relevant�문서들에게서�모든�term들을�추출해서�OW_RSJ로�ranking을�한다.�

(2)�ranking�된�리스트에서�첫�k�개를�query에�include�한다.�

•이�질의확장법은�BIM과�RSJ�weighting을�사용하는�모델이�대상이긴�하지만�BM25에도�나름�성공적이었다고�한다.�물론�단점도�있다.��이�질의확장법을�통해�유사어들을�질의에�포함시킬�수는�있지만�자연스럽게�query�term�independence�assumption이�약해진다는�것.��(지금까지�고려했던�weight�함수들은�이�독립�assumption에�기반을�두고�있다)

Page 59: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

•그렇다면�OW를�이용한�query�expansion�과정은�다음과�같겠구나:�

(1)�Relevant�문서들에게서�모든�term들을�추출해서�OW_RSJ로�ranking을�한다.�

(2)�ranking�된�리스트에서�첫�k�개를�query에�include�한다.�

•이�질의확장법은�BIM과�RSJ�weighting을�사용하는�모델이�대상이긴�하지만�BM25에도�나름�성공적이었다고�한다.�물론�단점도�있다.��이�질의확장법을�통해�유사어들을�질의에�포함시킬�수는�있지만�자연스럽게�query�term�independence�assumption이�약해진다는�것.��(지금까지�고려했던�weight�함수들은�이�독립�assumption에�기반을�두고�있다)

Page 60: IR Modeling(검색 시스템의 모델링)

RELEVANCE�FEEDBACK�AND�QUERY�EXPANSION

•그렇다면�OW를�이용한�query�expansion�과정은�다음과�같겠구나:�

(1)�Relevant�문서들에게서�모든�term들을�추출해서�OW_RSJ로�ranking을�한다.�

(2)�ranking�된�리스트에서�첫�k�개를�query에�include�한다.�

•이�질의확장법은�BIM과�RSJ�weighting을�사용하는�모델이�대상이긴�하지만�BM25에도�나름�성공적이었다고�한다.�물론�단점도�있다.��이�질의확장법을�통해�유사어들을�질의에�포함시킬�수는�있지만�자연스럽게�query�term�independence�assumption이�약해진다는�것.��(지금까지�고려했던�weight�함수들은�이�독립�assumption에�기반을�두고�있다)

Page 61: IR Modeling(검색 시스템의 모델링)

BLIND�FEEDBACK

•먼저�relevance�judgement가�전혀�없다고�가정.�그리고�다음�procedure를�따른다:�

(1)�Initial�query로�검색�ㄱㄱ�

(2)�top�k�개의�문서를�relevant하다고�가정한다.�

(3)�앞에서�본�질의�확장법�사용�:�(2)에서�가정한�relevant�set에서�term�extraction�한�뒤�OW_RSJ사용해서�랭킹하고�top�m�개를�query에�include!

Page 62: IR Modeling(검색 시스템의 모델링)

BLIND�FEEDBACK

•먼저�relevance�judgement가�전혀�없다고�가정.�그리고�다음�procedure를�따른다:�

(1)�Initial�query로�검색�ㄱㄱ�

(2)�top�k�개의�문서를�relevant하다고�가정한다.�

(3)�앞에서�본�질의�확장법�사용�:�(2)에서�가정한�relevant�set에서�term�extraction�한�뒤�OW_RSJ사용해서�랭킹하고�top�m�개를�query에�include!

Page 63: IR Modeling(검색 시스템의 모델링)

BLIND�FEEDBACK

•먼저�relevance�judgement가�전혀�없다고�가정.�그리고�다음�procedure를�따른다:�

(1)�Initial�query로�검색�ㄱㄱ�

(2)�top�k�개의�문서를�relevant하다고�가정한다.�

(3)�앞에서�본�질의�확장법�사용�:�(2)에서�가정한�relevant�set에서�term�extraction�한�뒤�OW_RSJ사용해서�랭킹하고�top�m�개를�query에�include!

Page 64: IR Modeling(검색 시스템의 모델링)

BLIND�FEEDBACK

•먼저�relevance�judgement가�전혀�없다고�가정.�그리고�다음�procedure를�따른다:�

(1)�Initial�query로�검색�ㄱㄱ�

(2)�top�k�개의�문서를�relevant하다고�가정한다.�

(3)�앞에서�본�질의�확장법�사용�:�(2)에서�가정한�relevant�set에서�term�extraction�한�뒤�OW_RSJ사용해서�랭킹하고�top�m�개를�query에�include!

Page 65: IR Modeling(검색 시스템의 모델링)

BLIND�FEEDBACK

•먼저�relevance�judgement가�전혀�없다고�가정.�그리고�다음�procedure를�따른다:�

(1)�Initial�query로�검색�ㄱㄱ�

(2)�top�k�개의�문서를�relevant하다고�가정한다.�

(3)�앞에서�본�질의�확장법�사용�:�(2)에서�가정한�relevant�set에서�term�extraction�한�뒤�OW_RSJ사용해서�랭킹하고�top�m�개를�query에�include!

Page 66: IR Modeling(검색 시스템의 모델링)

BLIND�FEEDBACK

•다음�사항들을�보고�Blind�feedback을�마무리�하도록�하자.

(1)�Blind�feedback은�평균적으로�검색�결과�향상에�도움이�되긴�하지만�initial�결과가�좋지�않은�경우는�fail…�(극단적인�케이스로�초기�검색의�top�k�가�모두�relevant하지�않은�문서였다면�엄한�term들로�질의�확장을�하게�되니까…)�

(2)�말�그대로�blindly�선택하는�것이기�때문에�“relevant하다”라고�단정�짓기�보다는�explicit�probability�of�relevance를�고려하면�좋겠지만�아까도�말했듯이�지금�모델(rank만�고려하는�모델)에서는�그걸�알�방도가�없네…

Page 67: IR Modeling(검색 시스템의 모델링)

BLIND�FEEDBACK

•다음�사항들을�보고�Blind�feedback을�마무리�하도록�하자.

(1)�Blind�feedback은�평균적으로�검색�결과�향상에�도움이�되긴�하지만�initial�결과가�좋지�않은�경우는�fail…�(극단적인�케이스로�초기�검색의�top�k�가�모두�relevant하지�않은�문서였다면�엄한�term들로�질의�확장을�하게�되니까…)�

(2)�말�그대로�blindly�선택하는�것이기�때문에�“relevant하다”라고�단정�짓기�보다는�explicit�probability�of�relevance를�고려하면�좋겠지만�아까도�말했듯이�지금�모델(rank만�고려하는�모델)에서는�그걸�알�방도가�없네…

Page 68: IR Modeling(검색 시스템의 모델링)

BLIND�FEEDBACK

•다음�사항들을�보고�Blind�feedback을�마무리�하도록�하자.

(1)�Blind�feedback은�평균적으로�검색�결과�향상에�도움이�되긴�하지만�initial�결과가�좋지�않은�경우는�fail…�(극단적인�케이스로�초기�검색의�top�k�가�모두�relevant하지�않은�문서였다면�엄한�term들로�질의�확장을�하게�되니까…)�

(2)�말�그대로�blindly�선택하는�것이기�때문에�“relevant하다”라고�단정�짓기�보다는�explicit�probability�of�relevance를�고려하면�좋겠지만�아까도�말했듯이�지금�모델(rank만�고려하는�모델)에서는�그걸�알�방도가�없네…

Page 69: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•새로운�컨셉을�소개해�보자.�Eliteness.��이것또한�이진(binary)�property인데��

로�생각하면�될�듯.�Definition은�다음과�같다:�

T ⇥D ! {elite, elite}

E(t, d) =

8<

:

elite, if d is about the term t

elite, otherwise

Page 70: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•새로운�컨셉을�소개해�보자.�Eliteness.��이것또한�이진(binary)�property인데��

로�생각하면�될�듯.�Definition은�다음과�같다:�

T ⇥D ! {elite, elite}

E(t, d) =

8<

:

elite, if d is about the term t

elite, otherwise

Page 71: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•새로운�컨셉을�소개해�보자.�Eliteness.��이것또한�이진(binary)�property인데��

로�생각하면�될�듯.�Definition은�다음과�같다:�

T ⇥D ! {elite, elite}

E(t, d) =

8<

:

elite, if d is about the term t

elite, otherwise

Page 72: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

• Eliteness�property가�가지는�assumptions�

(1)��������depends�on�the�eliteness.��

-�즉,�한�document�d�내에서�term�t�의�출현�빈도수는�“d�가�t�에�관한�것이냐�아니냐”�에�영향을�받는다.�

(2)�There�“maybe”�an�association�between�eliteness�and�relevance.�

(3)�위�2개의�assumption만으로�tf와�relevance의�관계가�설명�가능하다.�(즉,�tf는�relevance�와�독립적이다).

tfi

Page 73: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

• Eliteness�property가�가지는�assumptions�

(1)��������depends�on�the�eliteness.��

-�즉,�한�document�d�내에서�term�t�의�출현�빈도수는�“d�가�t�에�관한�것이냐�아니냐”�에�영향을�받는다.�

(2)�There�“maybe”�an�association�between�eliteness�and�relevance.�

(3)�위�2개의�assumption만으로�tf와�relevance의�관계가�설명�가능하다.�(즉,�tf는�relevance�와�독립적이다).

tfi

Page 74: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

• Eliteness�property가�가지는�assumptions�

(1)��������depends�on�the�eliteness.��

-�즉,�한�document�d�내에서�term�t�의�출현�빈도수는�“d�가�t�에�관한�것이냐�아니냐”�에�영향을�받는다.�

(2)�There�“maybe”�an�association�between�eliteness�and�relevance.�

(3)�위�2개의�assumption만으로�tf와�relevance의�관계가�설명�가능하다.�(즉,�tf는�relevance�와�독립적이다).

tfi

Page 75: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

• Eliteness�property가�가지는�assumptions�

(1)��������depends�on�the�eliteness.��

-�즉,�한�document�d�내에서�term�t�의�출현�빈도수는�“d�가�t�에�관한�것이냐�아니냐”�에�영향을�받는다.�

(2)�There�“maybe”�an�association�between�eliteness�and�relevance.�

(3)�위�2개의�assumption만으로�tf와�relevance의�관계가�설명�가능하다.�(즉,�tf는�relevance�와�독립적이다).

tfi

Page 76: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•이제는�Eliteness�property의�notation을�보자.�

(1)���������������������������������������:�문서�d�가�relevant�할�때�d�가�t_i에�관한�문서일�확률.�

(2)�

(3)���������������������������������������������������:�문서�d�가�t_i에�관한�문서일�때,�d�에�나오는�t_i의�빈도수가�tf일�확률.�

(4)�

(5)

Pi1 = P (Ei = elite|rel)

Pi0 = P (Ei = elite|rel)

Ei1 = P (TFi = tfi|Ei = elite)

Ei0 = P (TFi = tfi|Ei = elite)

) P (TFi = tfi|rel) = Pi1 · Ei1(tfi) + (1� Pi1) · Ei0(tfi)

Page 77: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•이제는�Eliteness�property의�notation을�보자.�

(1)���������������������������������������:�문서�d�가�relevant�할�때�d�가�t_i에�관한�문서일�확률.�

(2)�

(3)���������������������������������������������������:�문서�d�가�t_i에�관한�문서일�때,�d�에�나오는�t_i의�빈도수가�tf일�확률.�

(4)�

(5)

Pi1 = P (Ei = elite|rel)

Pi0 = P (Ei = elite|rel)

Ei1 = P (TFi = tfi|Ei = elite)

Ei0 = P (TFi = tfi|Ei = elite)

) P (TFi = tfi|rel) = Pi1 · Ei1(tfi) + (1� Pi1) · Ei0(tfi)

Page 78: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•이제는�Eliteness�property의�notation을�보자.�

(1)���������������������������������������:�문서�d�가�relevant�할�때�d�가�t_i에�관한�문서일�확률.�

(2)�

(3)���������������������������������������������������:�문서�d�가�t_i에�관한�문서일�때,�d�에�나오는�t_i의�빈도수가�tf일�확률.�

(4)�

(5)

Pi1 = P (Ei = elite|rel)

Pi0 = P (Ei = elite|rel)

Ei1 = P (TFi = tfi|Ei = elite)

Ei0 = P (TFi = tfi|Ei = elite)

) P (TFi = tfi|rel) = Pi1 · Ei1(tfi) + (1� Pi1) · Ei0(tfi)

Page 79: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•이제는�Eliteness�property의�notation을�보자.�

(1)���������������������������������������:�문서�d�가�relevant�할�때�d�가�t_i에�관한�문서일�확률.�

(2)�

(3)���������������������������������������������������:�문서�d�가�t_i에�관한�문서일�때,�d�에�나오는�t_i의�빈도수가�tf일�확률.�

(4)�

(5)

Pi1 = P (Ei = elite|rel)

Pi0 = P (Ei = elite|rel)

Ei1 = P (TFi = tfi|Ei = elite)

Ei0 = P (TFi = tfi|Ei = elite)

) P (TFi = tfi|rel) = Pi1 · Ei1(tfi) + (1� Pi1) · Ei0(tfi)

Page 80: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•이제는�Eliteness�property의�notation을�보자.�

(1)���������������������������������������:�문서�d�가�relevant�할�때�d�가�t_i에�관한�문서일�확률.�

(2)�

(3)���������������������������������������������������:�문서�d�가�t_i에�관한�문서일�때,�d�에�나오는�t_i의�빈도수가�tf일�확률.�

(4)�

(5)

Pi1 = P (Ei = elite|rel)

Pi0 = P (Ei = elite|rel)

Ei1 = P (TFi = tfi|Ei = elite)

Ei0 = P (TFi = tfi|Ei = elite)

) P (TFi = tfi|rel) = Pi1 · Ei1(tfi) + (1� Pi1) · Ei0(tfi)

Page 81: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•이제는�Eliteness�property의�notation을�보자.�

(1)���������������������������������������:�문서�d�가�relevant�할�때�d�가�t_i에�관한�문서일�확률.�

(2)�

(3)���������������������������������������������������:�문서�d�가�t_i에�관한�문서일�때,�d�에�나오는�t_i의�빈도수가�tf일�확률.�

(4)�

(5)

Pi1 = P (Ei = elite|rel)

Pi0 = P (Ei = elite|rel)

Ei1 = P (TFi = tfi|Ei = elite)

Ei0 = P (TFi = tfi|Ei = elite)

) P (TFi = tfi|rel) = Pi1 · Ei1(tfi) + (1� Pi1) · Ei0(tfi)

Page 82: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

• (5)를�이전�weight�함수에�대입해�주면?��

) P (TFi = tfi|rel) = Pi1 · Ei1(tfi) + (1� Pi1) · Ei0(tfi)

w

elitei = log

✓(p1E1(tf) + (1� p1)E0(tf))(p0E1(0) + (1� p0)E0(0))

(p1E1(0) + (1� p1)E0(0))(p0E1(tf) + (1� p0)E0(tf))

Page 83: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

• (5)를�이전�weight�함수에�대입해�주면?��

) P (TFi = tfi|rel) = Pi1 · Ei1(tfi) + (1� Pi1) · Ei0(tfi)

w

elitei = log

✓(p1E1(tf) + (1� p1)E0(tf))(p0E1(0) + (1� p0)E0(0))

(p1E1(0) + (1� p1)E0(0))(p0E1(tf) + (1� p0)E0(tf))

Page 84: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•Assumption�하나만�추가요�:��

term�frequency�는�poisson�분포를�따른다.�

즉,����������������������.��.�마찬가지로�������������������������.�일반적으로������������������이라고�expect�한다.�(d�가�t_i�에�관한�문서라면�t_i가�더�많이�나타날�것이라�예상한다?�말이되네).��그리고�우리는�이것을�2�Poisson�model�이라�부른다.�

•뿌와송�분포�:�

Ei0(tf) ⇠ P (�i0) Ei1(tf) ⇠ P (�i1)

�i1 > �i0

P (k events in interval) =�ke��

k!

Page 85: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•Assumption�하나만�추가요�:��

term�frequency�는�poisson�분포를�따른다.�

즉,�������������������������.�마찬가지로�������������������������.�일반적으로������������������이라고�expect�한다.�(d�가�t_i�에�관한�문서라면�t_i가�더�많이�나타날�것이라�예상한다?�말이되네).��그리고�우리는�이것을�2�Poisson�model�이라�부른다.�

•뿌와송�분포�:�

Ei0(tf) ⇠ P (�i0) Ei1(tf) ⇠ P (�i1)

�i1 > �i0

P (k events in interval) =�ke��

k!

Page 86: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•Assumption�하나만�추가요�:��

term�frequency�는�poisson�분포를�따른다.�

즉,�������������������������.�마찬가지로�������������������������.�일반적으로������������������이라고�expect�한다.�(d�가�t_i�에�관한�문서라면�t_i가�더�많이�나타날�것이라�예상한다?�말이되네).��그리고�우리는�이것을�2�Poisson�model�이라�부른다.�

•뿌와송�분포�:�

Ei0(tf) ⇠ P (�i0) Ei1(tf) ⇠ P (�i1)

�i1 > �i0

P (k events in interval) =�ke��

k!

Page 87: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•그런데�왜�뿌와송�분포를�따른다는거지?�이�뿌와송�분포는�어디서�나온거지?

• Harter의�모델링을�보면�:�

(1)�각�document�는�“word�position을�채우는�방식”으로�생성되며,��

(2)�각�position에�각�word가�들어갈�확률�분포는�다항분포(multinomial�distribution)를�따른다.���

-�즉,�각�position에�각�word가�들어갈�확률은�fixed�(position�dependent�하지�않음)이며,�다른�word가�들어갈�확률과는�독립적이다.�

(3)�따라서�주어진�term에�관해�tf는�이항분포(binomial�distribution)를�따르며�이는�poisson�분포와�유사하다!��굳굳.�왜�뿌와송�분포를�따른다고�했는지�이해가�된다.

Page 88: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•그런데�왜�뿌와송�분포를�따른다는거지?�이�뿌와송�분포는�어디서�나온거지?

• Harter의�모델링을�보면�:�

(1)�각�document�는�“word�position을�채우는�방식”으로�생성되며,��

(2)�각�position에�각�word가�들어갈�확률�분포는�다항분포(multinomial�distribution)를�따른다.���

-�즉,�각�position에�각�word가�들어갈�확률은�fixed�(position�dependent�하지�않음)이며,�다른�word가�들어갈�확률과는�독립적이다.�

(3)�따라서�주어진�term에�관해�tf는�이항분포(binomial�distribution)를�따르며�이는�poisson�분포와�유사하다!��굳굳.�왜�뿌와송�분포를�따른다고�했는지�이해가�된다.

Page 89: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•그런데�왜�뿌와송�분포를�따른다는거지?�이�뿌와송�분포는�어디서�나온거지?

• Harter의�모델링을�보면�:�

(1)�각�document�는�“word�position을�채우는�방식”으로�생성되며,��

(2)�각�position에�각�word가�들어갈�확률�분포는�다항분포(multinomial�distribution)를�따른다.���

-�즉,�각�position에�각�word가�들어갈�확률은�fixed�(position�dependent�하지�않음)이며,�다른�word가�들어갈�확률과는�독립적이다.�

(3)�따라서�주어진�term에�관해�tf는�이항분포(binomial�distribution)를�따르며�이는�poisson�분포와�유사하다!��굳굳.�왜�뿌와송�분포를�따른다고�했는지�이해가�된다.

Page 90: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•그런데�왜�뿌와송�분포를�따른다는거지?�이�뿌와송�분포는�어디서�나온거지?

• Harter의�모델링을�보면�:�

(1)�각�document�는�“word�position을�채우는�방식”으로�생성되며,��

(2)�각�position에�각�word가�들어갈�확률�분포는�다항분포(multinomial�distribution)를�따른다.���

-�즉,�각�position에�각�word가�들어갈�확률은�fixed�(position�dependent�하지�않음)이며,�다른�word가�들어갈�확률과는�독립적이다.�

(3)�따라서�주어진�term에�관해�tf는�이항분포(binomial�distribution)를�따르며�이는�poisson�분포와�유사하다!��굳굳.�왜�뿌와송�분포를�따른다고�했는지�이해가�된다.

Page 91: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•그런데�왜�뿌와송�분포를�따른다는거지?�이�뿌와송�분포는�어디서�나온거지?

• Harter의�모델링을�보면�:�

(1)�각�document�는�“word�position을�채우는�방식”으로�생성되며,��

(2)�각�position에�각�word가�들어갈�확률�분포는�다항분포(multinomial�distribution)를�따른다.���

-�즉,�각�position에�각�word가�들어갈�확률은�fixed�(position�dependent�하지�않음)이며,�다른�word가�들어갈�확률과는�독립적이다.�

(3)�따라서�주어진�term에�관해�tf는�이항분포(binomial�distribution)를�따르며�이는�poisson�분포와�유사하다!��굳굳.�왜�뿌와송�분포를�따른다고�했는지�이해가�된다.

Page 92: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•그러면�Ei1이랑�Ei0대신에�뿌와송�분포를�대입해서�식을�보면�되겠네!�근데�결과가�너무�지저분…�어떻게�좀�쉽게�만들�수�없을까?��

•������������랑�비슷한�“모양”을�가진�함수�이면서�좀�더�간단한�함수로�approximate하면�좋을�것�같은데?�그럼������������의�일반적�특징을�좀�보도록�하자:�

(1)�������������������������(term�frequency가�0�이면�weight�가�0이다).�

(2)������가�늘어나면�������������������도�같이�늘어난다.�

(3)�하지만�����가�무한으로�갈�수록������������������는�������������������로�수렴한다�:�saturation�(어떠한�term도�document�scoring에�끼칠�수�있는�영향은�한계가�있다)

welitei (0) = 0

tf welitei (tf)

tf welitei (tf) wBIM

i (tf)

welitei

welitei

Page 93: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•그러면�Ei1이랑�Ei0대신에�뿌와송�분포를�대입해서�식을�보면�되겠네!�근데�결과가�너무�지저분…�어떻게�좀�쉽게�만들�수�없을까?��

•������������랑�비슷한�“모양”을�가진�함수�이면서�좀�더�간단한�함수로�approximate하면�좋을�것�같은데?�그럼������������의�일반적�특징을�좀�보도록�하자:�

(1)�������������������������(term�frequency가�0�이면�weight�가�0이다).�

(2)������가�늘어나면�������������������도�같이�늘어난다.�

(3)�하지만�����가�무한으로�갈�수록������������������는�������������������로�수렴한다�:�saturation�(어떠한�term도�document�scoring에�끼칠�수�있는�영향은�한계가�있다)

welitei (0) = 0

tf welitei (tf)

tf welitei (tf) wBIM

i (tf)

welitei

welitei

Page 94: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•그러면�Ei1이랑�Ei0대신에�뿌와송�분포를�대입해서�식을�보면�되겠네!�근데�결과가�너무�지저분…�어떻게�좀�쉽게�만들�수�없을까?��

•������������랑�비슷한�“모양”을�가진�함수�이면서�좀�더�간단한�함수로�approximate하면�좋을�것�같은데?�그럼������������의�일반적�특징을�좀�보도록�하자:�

(1)�������������������������(term�frequency가�0�이면�weight�가�0이다).�

(2)������가�늘어나면�������������������도�같이�늘어난다.�

(3)�하지만�����가�무한으로�갈�수록������������������는�������������������로�수렴한다�:�saturation�(어떠한�term도�document�scoring에�끼칠�수�있는�영향은�한계가�있다)

welitei (0) = 0

tf welitei (tf)

tf welitei (tf) wBIM

i (tf)

welitei

welitei

Page 95: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•그러면�Ei1이랑�Ei0대신에�뿌와송�분포를�대입해서�식을�보면�되겠네!�근데�결과가�너무�지저분…�어떻게�좀�쉽게�만들�수�없을까?��

•������������랑�비슷한�“모양”을�가진�함수�이면서�좀�더�간단한�함수로�approximate하면�좋을�것�같은데?�그럼������������의�일반적�특징을�좀�보도록�하자:�

(1)�������������������������(term�frequency가�0�이면�weight�가�0이다).�

(2)������가�늘어나면�������������������도�같이�늘어난다.�

(3)�하지만�����가�무한으로�갈�수록������������������는�������������������로�수렴한다�:�saturation�(어떠한�term도�document�scoring에�끼칠�수�있는�영향은�한계가�있다)

welitei (0) = 0

tf welitei (tf)

tf welitei (tf) wBIM

i (tf)

welitei

welitei

Page 96: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•그러면�Ei1이랑�Ei0대신에�뿌와송�분포를�대입해서�식을�보면�되겠네!�근데�결과가�너무�지저분…�어떻게�좀�쉽게�만들�수�없을까?��

•������������랑�비슷한�“모양”을�가진�함수�이면서�좀�더�간단한�함수로�approximate하면�좋을�것�같은데?�그럼������������의�일반적�특징을�좀�보도록�하자:�

(1)�������������������������(term�frequency가�0�이면�weight�가�0이다).�

(2)������가�늘어나면�������������������도�같이�늘어난다.�

(3)�하지만�����가�무한으로�갈�수록������������������는�������������������로�수렴한다�:�saturation�(어떠한�term도�document�scoring에�끼칠�수�있는�영향은�한계가�있다)

welitei (0) = 0

tf welitei (tf)

tf welitei (tf) wBIM

i (tf)

welitei

welitei

Page 97: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•음…�그런데�저런�특징들을�만족하는�함수가�뭐가�있을까…�일단�잘�모르겠으니까�간단한�함수로�시작�해보자:�

�������������������������

• 이�함수면�아까�말�했던�3가지�property들�다�만족을�하는군.�

• k�가�작으면�금방�asymptote에�다다르고�클수록�계속해서�tf의�증가량이�영향력을�보인다.

tf

k + tffor some k > 0

Page 98: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•음…�그런데�저런�특징들을�만족하는�함수가�뭐가�있을까…�일단�잘�모르겠으니까�간단한�함수로�시작�해보자:�

�������������������������

• 이�함수면�아까�말�했던�3가지�property들�다�만족을�하는군.�

• k�가�작으면�금방�asymptote에�다다르고�클수록�계속해서�tf의�증가량이�영향력을�보인다.

tf

k + tffor some k > 0

Page 99: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•음…�그런데�저런�특징들을�만족하는�함수가�뭐가�있을까…�일단�잘�모르겠으니까�간단한�함수로�시작�해보자:�

�������������������������

• 이�함수면�아까�말�했던�3가지�property들�다�만족을�하는군.�

• k�가�작으면�금방�asymptote에�다다르고�클수록�계속해서�tf의�증가량이�영향력을�보인다.

tf

k + tffor some k > 0

Page 100: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•음…�그런데�저런�특징들을�만족하는�함수가�뭐가�있을까…�일단�잘�모르겠으니까�간단한�함수로�시작�해보자:�

�������������������������

• 이�함수면�아까�말�했던�3가지�property들�다�만족을�하는군.�

• k�가�작으면�금방�asymptote에�다다르고�클수록�계속해서�tf의�증가량이�영향력을�보인다.

tf

k + tffor some k > 0

Page 101: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

• Saturation과�asymptotic�maximum�������������의�approximation�������������을�합쳐서�BM25의�초기모델이�탄생!�

•한가지�마지막으로�고려해야�될�점이�남았다.�바로�document�length!

(wBIM )

(wRSJ)

wi(tf) =tf

k + tfwRSJ

i

Page 102: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

• Saturation과�asymptotic�maximum�������������의�approximation�������������을�합쳐서�BM25의�초기모델이�탄생!�

•한가지�마지막으로�고려해야�될�점이�남았다.�바로�document�length!

(wBIM )

(wRSJ)

wi(tf) =tf

k + tfwRSJ

i

Page 103: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

• Saturation과�asymptotic�maximum�������������의�approximation�������������을�합쳐서�BM25의�초기모델이�탄생!�

•한가지�마지막으로�고려해야�될�점이�남았다.�바로�document�length!

(wBIM )

(wRSJ)

wi(tf) =tf

k + tfwRSJ

i

Page 104: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•Document�length는�다음�두가지�factor�에�의해�길어질�수�있다.�

(1)�Verbosity�:�같은�얘기를�주절주절�길게하는�저자�(극단적인�케이스�:�복붙해서�문서�크기�n배로�만들기)�

(2)�Scope�:�한�document에�여러�주제들을�구겨�넣는�저자�(극단적인�케이스�:�전혀�다른�주제의�문서들을�concatenate)�

• Verbosity가�발견되는�경우에는�document�length로�normalize하면�되고�Scope인�경우에는�반대로�해주면�된다.

Page 105: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•Document�length는�다음�두가지�factor�에�의해�길어질�수�있다.�

(1)�Verbosity�:�같은�얘기를�주절주절�길게하는�저자�(극단적인�케이스�:�복붙해서�문서�크기�n배로�만들기)�

(2)�Scope�:�한�document에�여러�주제들을�구겨�넣는�저자�(극단적인�케이스�:�전혀�다른�주제의�문서들을�concatenate)�

• Verbosity가�발견되는�경우에는�document�length로�normalize하면�되고�Scope인�경우에는�반대로�해주면�된다.

Page 106: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•Document�length는�다음�두가지�factor�에�의해�길어질�수�있다.�

(1)�Verbosity�:�같은�얘기를�주절주절�길게하는�저자�(극단적인�케이스�:�복붙해서�문서�크기�n배로�만들기)�

(2)�Scope�:�한�document에�여러�주제들을�구겨�넣는�저자�(극단적인�케이스�:�전혀�다른�주제의�문서들을�concatenate)�

• Verbosity가�발견되는�경우에는�document�length로�normalize하면�되고�Scope인�경우에는�반대로�해주면�된다.

Page 107: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•Document�length는�다음�두가지�factor�에�의해�길어질�수�있다.�

(1)�Verbosity�:�같은�얘기를�주절주절�길게하는�저자�(극단적인�케이스�:�복붙해서�문서�크기�n배로�만들기)�

(2)�Scope�:�한�document에�여러�주제들을�구겨�넣는�저자�(극단적인�케이스�:�전혀�다른�주제의�문서들을�concatenate)�

• Verbosity가�발견되는�경우에는�document�length로�normalize하면�되고�Scope인�경우에는�반대로�해주면�된다.

Page 108: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•실제�문서에는�verbosity,�scope�둘�다�나타나는데�일반적으로�두개의�콤보로�나온다고�생각을�하고�다음�notion들을�보자.�

�Document�length�:�

�Average�doclength�:�

�Length�Normalisation�component�:�

(b가�1이면�fully�정규화,�b�=�0�이면�노�정규화)

dl =

|V |X

i

tfi

avdl

B :=

✓(1� b) + b

dl

avdl

◆, 0 b 1

Page 109: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•실제�문서에는�verbosity,�scope�둘�다�나타나는데�일반적으로�두개의�콤보로�나온다고�생각을�하고�다음�notion들을�보자.�

�Document�length�:�

�Average�doclength�:�

�Length�Normalisation�component�:�

(b가�1이면�fully�정규화,�b�=�0�이면�노�정규화)

dl =

|V |X

i

tfi

avdl

B :=

✓(1� b) + b

dl

avdl

◆, 0 b 1

Page 110: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•실제�문서에는�verbosity,�scope�둘�다�나타나는데�일반적으로�두개의�콤보로�나온다고�생각을�하고�다음�notion들을�보자.�

�Document�length�:�

�Average�doclength�:�

�Length�Normalisation�component�:�

(b가�1이면�fully�정규화,�b�=�0�이면�노�정규화)

dl =

|V |X

i

tfi

avdl

B :=

✓(1� b) + b

dl

avdl

◆, 0 b 1

Page 111: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•실제�문서에는�verbosity,�scope�둘�다�나타나는데�일반적으로�두개의�콤보로�나온다고�생각을�하고�다음�notion들을�보자.�

�Document�length�:�

�Average�doclength�:�

�Length�Normalisation�component�:�

(b가�1이면�fully�정규화,�b�=�0�이면�노�정규화)

dl =

|V |X

i

tfi

avdl

B :=

✓(1� b) + b

dl

avdl

◆, 0 b 1

Page 112: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•자,�이제�이�term들을�가지고�saturation�function에�적용해�보자!�

•많은�experiment들을�해보셨고�0.5�<�b�<�0.8,�그리고�1.2�<�k�<�2�정도가�괜춘.��하지만�최적값은�문서의�종류나�질의�종류�등�다른�factor들에�dependent�하다는�설도�있다.

tf 0 =tf

B

wBM25i (tf) =

tf 0

k1 + tf 0 · wRSJi

=tf

k1�(1� b) + b dl

avdl

�+ tf

· wRSJi

Page 113: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•자,�이제�이�term들을�가지고�saturation�function에�적용해�보자!�

•많은�experiment들을�해보셨고�0.5�<�b�<�0.8,�그리고�1.2�<�k�<�2�정도가�괜춘.��하지만�최적값은�문서의�종류나�질의�종류�등�다른�factor들에�dependent�하다는�설도�있다.

tf 0 =tf

B

wBM25i (tf) =

tf 0

k1 + tf 0 · wRSJi

=tf

k1�(1� b) + b dl

avdl

�+ tf

· wRSJi

Page 114: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•자,�이제�이�term들을�가지고�saturation�function에�적용해�보자!�

•많은�experiment들을�해보셨고�0.5�<�b�<�0.8,�그리고�1.2�<�k�<�2�정도가�괜춘.��하지만�최적값은�문서의�종류나�질의�종류�등�다른�factor들에�dependent�하다는�설도�있다.

tf 0 =tf

B

wBM25i (tf) =

tf 0

k1 + tf 0 · wRSJi

=tf

k1�(1� b) + b dl

avdl

�+ tf

· wRSJi

Page 115: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•자,�이제�이�term들을�가지고�saturation�function에�적용해�보자!�

•많은�experiment들을�해보셨고�0.5�<�b�<�0.8,�그리고�1.2�<�k�<�2�정도가�괜춘.��하지만�최적값은�문서의�종류나�질의�종류�등�다른�factor들에�dependent�하다는�설도�있다.

tf 0 =tf

B

wBM25i (tf) =

tf 0

k1 + tf 0 · wRSJi

=tf

k1�(1� b) + b dl

avdl

�+ tf

· wRSJi

Page 116: IR Modeling(검색 시스템의 모델링)

THE�ELITENESS�MODEL�AND�BM25

•자,�이제�이�term들을�가지고�saturation�function에�적용해�보자!�

•많은�experiment들을�해보셨고�0.5�<�b�<�0.8,�그리고�1.2�<�k�<�2�정도가�괜춘.��하지만�최적값은�문서의�종류나�질의�종류�등�다른�factor들에�dependent�하다는�설도�있다.

tf 0 =tf

B

wBM25i (tf) =

tf 0

k1 + tf 0 · wRSJi

=tf

k1�(1� b) + b dl

avdl

�+ tf

· wRSJi

Page 117: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•지금까지는�문서가�하나의�body로�된�text이며�특정한�구조를�가지고�있지�않다고�생각했다.��하지만�보통�최소한의�구조는�가지고�있다.��여기서는�문서가�field�혹은�stream으로�구성된�구조를�가지고�있다고�생각해보자.�

예)�논문�:�Title�/�abstract�/�body�/�references�

• Title,�abstract,�body,�references가�이�문서의�stream들이다.�

• Stream으로�나누는�이유는�특정�stream이�relevance에�더�큰�영향을�미칠�수�있을�가능성이�있기�때문!�(예�:�타이틀�stream에서�질의�term�매칭이�references�stream에서�질의�term�매칭보다�relevance가�높다고�생각할�수�있다)

Page 118: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•지금까지는�문서가�하나의�body로�된�text이며�특정한�구조를�가지고�있지�않다고�생각했다.��하지만�보통�최소한의�구조는�가지고�있다.��여기서는�문서가�field�혹은�stream으로�구성된�구조를�가지고�있다고�생각해보자.�

예)�논문�:�Title�/�abstract�/�body�/�references�

• Title,�abstract,�body,�references가�이�문서의�stream들이다.�

• Stream으로�나누는�이유는�특정�stream이�relevance에�더�큰�영향을�미칠�수�있을�가능성이�있기�때문!�(예�:�타이틀�stream에서�질의�term�매칭이�references�stream에서�질의�term�매칭보다�relevance가�높다고�생각할�수�있다)

Page 119: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•지금까지는�문서가�하나의�body로�된�text이며�특정한�구조를�가지고�있지�않다고�생각했다.��하지만�보통�최소한의�구조는�가지고�있다.��여기서는�문서가�field�혹은�stream으로�구성된�구조를�가지고�있다고�생각해보자.�

예)�논문�:�Title�/�abstract�/�body�/�references�

• Title,�abstract,�body,�references가�이�문서의�stream들이다.�

• Stream으로�나누는�이유는�특정�stream이�relevance에�더�큰�영향을�미칠�수�있을�가능성이�있기�때문!�(예�:�타이틀�stream에서�질의�term�매칭이�references�stream에서�질의�term�매칭보다�relevance가�높다고�생각할�수�있다)

Page 120: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•자,�그렇다면�어떤�text�chunk�(stream으로�구분되지�않는)에�적용할�수�있는�scoring�함수�f�가�있다고�가정을�해보자.��직관적으로는�각각의�stream에�f�를�씌운�다음�linear�combination을�구하면�되겠네.�

•위�방식을�택한다고�했을�때,�Eliteness�모델을�예로�들자면�각각의�(stream,�term)�페어에다가�각기�다른�eliteness�property를�준다는건데…�각�term을�하나의�문서�내에서�다른�stream들에게�독립적으로�적용한다?�이건�좀�말이�안되네…�

• (Term,�document)�property가�문서�내�stream들과�공유를�한다는�설정이�좋겠다.

Page 121: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•자,�그렇다면�어떤�text�chunk�(stream으로�구분되지�않는)에�적용할�수�있는�scoring�함수�f�가�있다고�가정을�해보자.��직관적으로는�각각의�stream에�f�를�씌운�다음�linear�combination을�구하면�되겠네.�

•위�방식을�택한다고�했을�때,�Eliteness�모델을�예로�들자면�각각의�(stream,�term)�페어에다가�각기�다른�eliteness�property를�준다는건데…�각�term을�하나의�문서�내에서�다른�stream들에게�독립적으로�적용한다?�이건�좀�말이�안되네…�

• (Term,�document)�property가�문서�내�stream들과�공유를�한다는�설정이�좋겠다.

Page 122: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•자,�그렇다면�어떤�text�chunk�(stream으로�구분되지�않는)에�적용할�수�있는�scoring�함수�f�가�있다고�가정을�해보자.��직관적으로는�각각의�stream에�f�를�씌운�다음�linear�combination을�구하면�되겠네.�

•위�방식을�택한다고�했을�때,�Eliteness�모델을�예로�들자면�각각의�(stream,�term)�페어에다가�각기�다른�eliteness�property를�준다는건데…�각�term을�하나의�문서�내에서�다른�stream들에게�독립적으로�적용한다?�이건�좀�말이�안되네…�

• (Term,�document)�property가�문서�내�stream들과�공유를�한다는�설정이�좋겠다.

Page 123: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•Notations!

streams s = 1, ..., S

stream length sls

stream weights vs

document (tf1, ..., tf|V|)

tfi vector (tf1i, ..., tfSi)

Page 124: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•Notations!

streams s = 1, ..., S

stream length sls

stream weights vs

document (tf1, ..., tf|V|)

tfi vector (tf1i, ..., tfSi)

Page 125: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•Notations!

streams s = 1, ..., S

stream length sls

stream weights vs

document (tf1, ..., tf|V|)

tfi vector (tf1i, ..., tfSi)

Page 126: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•Notations!

streams s = 1, ..., S

stream length sls

stream weights vs

document (tf1, ..., tf|V|)

tfi vector (tf1i, ..., tfSi)

Page 127: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•Notations!

streams s = 1, ..., S

stream length sls

stream weights vs

document (tf1, ..., tf|V|)

tfi vector (tf1i, ..., tfSi)

Page 128: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

• stream에�weight를�추가한�BM25의�간단한�확장판!

ftfi =

SX

s=1

vstf si

edl =

SX

s=1

vssls

gavdl = average of

edl across documents

wsimpleBM25Fi =

ftfik1

⇣(1� b) + b

edlgavdl

⌘+

ftfi· wRSJ

i

Page 129: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

• stream에�weight를�추가한�BM25의�간단한�확장판!

ftfi =

SX

s=1

vstf si

edl =

SX

s=1

vssls

gavdl = average of

edl across documents

wsimpleBM25Fi =

ftfik1

⇣(1� b) + b

edlgavdl

⌘+

ftfi· wRSJ

i

Page 130: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

• stream에�weight를�추가한�BM25의�간단한�확장판!

ftfi =

SX

s=1

vstf si

edl =

SX

s=1

vssls

gavdl = average of

edl across documents

wsimpleBM25Fi =

ftfik1

⇣(1� b) + b

edlgavdl

⌘+

ftfi· wRSJ

i

Page 131: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

• stream에�weight를�추가한�BM25의�간단한�확장판!

ftfi =

SX

s=1

vstf si

edl =

SX

s=1

vssls

gavdl = average of

edl across documents

wsimpleBM25Fi =

ftfik1

⇣(1� b) + b

edlgavdl

⌘+

ftfi· wRSJ

i

Page 132: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

• stream마다�다른�normalisation�factor를�고려한�또다른�확장판!

ftfi =SX

s=1

vstf siBs

Bs =

✓(1� bs) + bs

slsavsls

◆, 0 bs 1

wBM25Fi =

ftfik1 +ftfi

· wRSJi

Page 133: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

• stream마다�다른�normalisation�factor를�고려한�또다른�확장판!

ftfi =SX

s=1

vstf siBs

Bs =

✓(1� bs) + bs

slsavsls

◆, 0 bs 1

wBM25Fi =

ftfik1 +ftfi

· wRSJi

Page 134: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

• stream마다�다른�normalisation�factor를�고려한�또다른�확장판!

ftfi =SX

s=1

vstf siBs

Bs =

✓(1� bs) + bs

slsavsls

◆, 0 bs 1

wBM25Fi =

ftfik1 +ftfi

· wRSJi

Page 135: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

• stream마다�다른�normalisation�factor를�고려한�또다른�확장판!

ftfi =SX

s=1

vstf siBs

Bs =

✓(1� bs) + bs

slsavsls

◆, 0 bs 1

wBM25Fi =

ftfik1 +ftfi

· wRSJi

Page 136: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•두�weight�함수�다�성능�괜춘.�

• degenerate�case도�있다:�stream이�굉장히�verbose해서�거의�모든�term을�다�포함하는�경우.�

•이전과�마찬가지로�relevance정보가�없는�경우�RSJ�대신�IDF를�사용해서�진행하면�됨.

wBM25Fi =

ftfik1 +ftfi

· wRSJiwsimpleBM25F

i =ftfi

k1 eB +ftfi· wRSJ

i

Page 137: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•두�weight�함수�다�성능�괜춘.�

• degenerate�case도�있다:�stream이�굉장히�verbose해서�거의�모든�term을�다�포함하는�경우.�

•이전과�마찬가지로�relevance정보가�없는�경우�RSJ�대신�IDF를�사용해서�진행하면�됨.

wBM25Fi =

ftfik1 +ftfi

· wRSJiwsimpleBM25F

i =ftfi

k1 eB +ftfi· wRSJ

i

Page 138: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•두�weight�함수�다�성능�괜춘.�

• degenerate�case도�있다:�stream이�굉장히�verbose해서�거의�모든�term을�다�포함하는�경우.�

•이전과�마찬가지로�relevance정보가�없는�경우�RSJ�대신�IDF를�사용해서�진행하면�됨.

wBM25Fi =

ftfik1 +ftfi

· wRSJiwsimpleBM25F

i =ftfi

k1 eB +ftfi· wRSJ

i

Page 139: IR Modeling(검색 시스템의 모델링)

MULTIPLE�STREAMS�AND�BM25F

•두�weight�함수�다�성능�괜춘.�

• degenerate�case도�있다:�stream이�굉장히�verbose해서�거의�모든�term을�다�포함하는�경우.�

•이전과�마찬가지로�relevance정보가�없는�경우�RSJ�대신�IDF를�사용해서�진행하면�됨.

wBM25Fi =

ftfik1 +ftfi

· wRSJiwsimpleBM25F

i =ftfi

k1 eB +ftfi· wRSJ

i

Page 140: IR Modeling(검색 시스템의 모델링)

• The�Probabilistic�Relevance�Framework:�BM25�and�Beyond,�by�S.�Robertson�and�H.�Zaragoza.�

• Evaluation�in�information�retrieval,�Introduction�to�Information�Retrieval�:�http://npl.stanford.edu/IR-book/pdf/11prob.pdf�

• IRBasic_Modeling_조근희.pdf�

• 정보�검색론,�이준호

References