3. Weighting and Matching against Indices

3. Weighting and Matching against Indices

2007.1.20.인공지능 연구실 송승미

Text : Finding out about Page:60-104

2

Microsopic Semantics and the Statistics of communication

• Table 3.1 English Letter Frequency

• Character frequencies good for simple ciphers, crosswords..UZQSOVUOHXMOPVGPOZPEVSGZW

SZOPFPESXUDBMETSXAIZVUEPHZHMDZSH

ZOWSFPAPPDTSVPQUZWTMXUZUHSXEPTE

POPDZSZUFPOMBZWPFUPZHMDJUDTMOH

MQ

빈도수 조사 결과 ] P : 16, Z : 14

E, T 에 해당될 가능성 높다 .

UZQSOVUOHXMOPVGPOZPEVSGZWSZOPF

PESXUDBMETSXAIZVUEPHZHMDZSHZOWS

FPAPPDTSVPQUZWTMXUZUHSXEPTEPOPD

ZSZUFPOMBZWPFUPZHMDJUDTMOHMQ

3

In this Chapter…• What are we counting?• What does the distribution of frequency

occurrences across this level of features tell us about the pattern of their use?

• What can we tell about the meaning of these features, based on such statistics?

• How can we find meaning in text?• How are such attempts to be

distinguished?

4

Remember Zipf• 언어학자 George Kingsley Zipf• 영어로 된 책에 나오는 단어들을 모두 세어

빈도수 조사• 미국 사람들이 가장 많이 사용하는 단어

the(1000) → of(500) → and(250) → to(125)

• 자주 사용하는 단어는 소수에 불과 , 다른 대부분의 단어들은 비슷하게 적은 횟수로 사용

5

• F(w) : the number of times word w occurs anywhere in the corpus

• Sorted the vocabulary according to frequency

Ex) r = 1 → the most frequently occuring word r = 2 → the next most frequently used word

6

• Zipf’s law– Empirical observation– F(r) : frequency of rank r word– F(r) = C / rα , α ≈ 1, C ≈ 0.1 ( Mathematical derivation of Zipf’s law –

chapter 5 )

7

Zipfian Distribution of AIT Words

• Word frequency as function of its frequency rank• Log/log plot• Nearly linear• Negative slope

8

Principle of Least Effort• Words as tools• Unification

– Authors would like to use a single word, always

• Diversification– Readers would like a unique word for each

purpose

• Vocabulary balance– Uses existing words, and avoid coining new

ones

9

WWW surfing behavior• A recent example of Zipf-like

distributions

10

A statistical Basis for Keyword Meaning

Nonnoise word

Noise words occurs very frequently

External keywords

Internal keywords

11

Word occurrence as a Poisson Process

• Function words : of , the, but– Occur randomly throughout arbitrary

text

• Content words

12

Resolving Power(1/2)• Repetition as an indication of emphasis• Resolving power = Ability of words to discriminate

content• Maximal at middle rank• Thresholds to filter others

– High frequency noise words– Low frequency, rare words

13

Resolving Power(2/2)

문서에 너무 많이 등장하기 때문에 문서들을 구분하고 대표 하는데 별 의미 없음

쓰이는 횟수가 매우 드문 희귀한 단어들 . 일반적인 문서 구분에는 도움이 되지

않는다 .

14

Language Distribution• Exhaustivity : Number of topics

indexed• Specificity : Ability to describe FOA

information need precisely• Index : A balance between user and

corpus• Not too exhaustive, not too specific

15

• Exhaustivity ≈ N(Terms) assigned to Document

• Exhaustive ▷ high recall, low precision• Document-oriented “representation” bias

• Specificity ≈ -1 N(documents) assigned same term

• Specific ▷ low recall, high precision• Query-oriented “discrimination” bias

16

Specificity/Exhaustivity Trade-Offs

17

Indexing Graph

18

Weighting the Index Relation

• Weight – strength of association with a single real number

• The strength of the relationship between keyword and document.

19

Informative Signals vs. Noise words

• The least informative word (Noise words)– occurs uniformly across the corpus.

• Ex) the

• Informative Signals– Measure to weight of the keyword

document

20

Hypothetical Word Distributions

uniform

distribution

rarely happen

s

21

Inverse Document Frequency

• Up to this point.• Really like to know is the number of

documents containing a keyword ▷ IDF• IDF ( Inverse Document Frequency)

– 전체 문서 중에서 키워드 k 가 출현한 문서의 역수– Comparison in terms of documents, not just

word occurances– IDF ↑ keyword k 를 포함하는 문서가 작다 . IDF ↓ 〃 많다 .

22

Vector Space• Vector 를 이용하여 어느 기준

위치로부터 얼마만큼 어느 방향으로 떨어져 있는지 측정가능

• 어떻게 문서들간의 similarity 를 계산할 것이냐를 보다 더 수학적으로 접근해 보자는 의미에서 등장함 .

23

정보

검색

질의어문서2

문서 1

단순하게 두개의 색인어를 기초로 한 2 차원 평면 고려 .

- ( 정보 , 검색 ) 좌표계

-문서 1 : D1(0.8,0.3 )

-문서 2 : D2(0.2,0.7 )

-질의어 : 정보검색 Q(0.4, 0.8)

-문서 1, 문서 2 누가 더 가까울까 ?

24

• Keyword 3 개• Query 와 가장 가까운 문서는

D1

25

Calculating TF-IDF Weighting

• TF – Term frequency• IDF – Inverse document frequency

• idf k = log ( Ndoc / Dk )• W kd = F kd * idf k

• F kd : the frequency with which keyword k occurs in docu. d

• Ndoc : the total number of document in the corpus

• Dk : the number of documents containing keyword k

26

SMART Weighting Specification

27

inverse

squared

probabilistic

frequency

Documents

3. Weighting and Matching against Indices