Upload
aspen-sanders
View
36
Download
3
Embed Size (px)
DESCRIPTION
3. Weighting and Matching against Indices. 2007.1.20. 인공지능 연구실 송승미 Text : Finding out about Page:60-104. Microsopic Semantics and the Statistics of communication. Table 3.1 English Letter Frequency Character frequencies good for simple ciphers, crosswords. UZQSOVUOHXMOPVGPOZPEVSGZWSZOPF - PowerPoint PPT Presentation
Citation preview
3. Weighting and Matching against Indices
2007.1.20.인공지능 연구실 송승미
Text : Finding out about Page:60-104
2
Microsopic Semantics and the Statistics of communication
• Table 3.1 English Letter Frequency
• Character frequencies good for simple ciphers, crosswords..UZQSOVUOHXMOPVGPOZPEVSGZW
SZOPFPESXUDBMETSXAIZVUEPHZHMDZSH
ZOWSFPAPPDTSVPQUZWTMXUZUHSXEPTE
POPDZSZUFPOMBZWPFUPZHMDJUDTMOH
MQ
빈도수 조사 결과 ] P : 16, Z : 14
E, T 에 해당될 가능성 높다 .
UZQSOVUOHXMOPVGPOZPEVSGZWSZOPF
PESXUDBMETSXAIZVUEPHZHMDZSHZOWS
FPAPPDTSVPQUZWTMXUZUHSXEPTEPOPD
ZSZUFPOMBZWPFUPZHMDJUDTMOHMQ
3
In this Chapter…• What are we counting?• What does the distribution of frequency
occurrences across this level of features tell us about the pattern of their use?
• What can we tell about the meaning of these features, based on such statistics?
• How can we find meaning in text?• How are such attempts to be
distinguished?
4
Remember Zipf• 언어학자 George Kingsley Zipf• 영어로 된 책에 나오는 단어들을 모두 세어
빈도수 조사• 미국 사람들이 가장 많이 사용하는 단어
the(1000) → of(500) → and(250) → to(125)
• 자주 사용하는 단어는 소수에 불과 , 다른 대부분의 단어들은 비슷하게 적은 횟수로 사용
5
• F(w) : the number of times word w occurs anywhere in the corpus
• Sorted the vocabulary according to frequency
Ex) r = 1 → the most frequently occuring word r = 2 → the next most frequently used word
6
• Zipf’s law– Empirical observation– F(r) : frequency of rank r word– F(r) = C / rα , α ≈ 1, C ≈ 0.1 ( Mathematical derivation of Zipf’s law –
chapter 5 )
7
Zipfian Distribution of AIT Words
• Word frequency as function of its frequency rank• Log/log plot• Nearly linear• Negative slope
8
Principle of Least Effort• Words as tools• Unification
– Authors would like to use a single word, always
• Diversification– Readers would like a unique word for each
purpose
• Vocabulary balance– Uses existing words, and avoid coining new
ones
9
WWW surfing behavior• A recent example of Zipf-like
distributions
10
A statistical Basis for Keyword Meaning
Nonnoise word
Noise words occurs very frequently
External keywords
Internal keywords
11
Word occurrence as a Poisson Process
• Function words : of , the, but– Occur randomly throughout arbitrary
text
• Content words
12
Resolving Power(1/2)• Repetition as an indication of emphasis• Resolving power = Ability of words to discriminate
content• Maximal at middle rank• Thresholds to filter others
– High frequency noise words– Low frequency, rare words
13
Resolving Power(2/2)
문서에 너무 많이 등장하기 때문에 문서들을 구분하고 대표 하는데 별 의미 없음
쓰이는 횟수가 매우 드문 희귀한 단어들 . 일반적인 문서 구분에는 도움이 되지
않는다 .
14
Language Distribution• Exhaustivity : Number of topics
indexed• Specificity : Ability to describe FOA
information need precisely• Index : A balance between user and
corpus• Not too exhaustive, not too specific
15
• Exhaustivity ≈ N(Terms) assigned to Document
• Exhaustive ▷ high recall, low precision• Document-oriented “representation” bias
• Specificity ≈ -1 N(documents) assigned same term
• Specific ▷ low recall, high precision• Query-oriented “discrimination” bias
16
Specificity/Exhaustivity Trade-Offs
17
Indexing Graph
18
Weighting the Index Relation
• Weight – strength of association with a single real number
• The strength of the relationship between keyword and document.
19
Informative Signals vs. Noise words
• The least informative word (Noise words)– occurs uniformly across the corpus.
• Ex) the
• Informative Signals– Measure to weight of the keyword
document
20
Hypothetical Word Distributions
uniform
distribution
rarely happen
s
21
Inverse Document Frequency
• Up to this point.• Really like to know is the number of
documents containing a keyword ▷ IDF• IDF ( Inverse Document Frequency)
– 전체 문서 중에서 키워드 k 가 출현한 문서의 역수– Comparison in terms of documents, not just
word occurances– IDF ↑ keyword k 를 포함하는 문서가 작다 . IDF ↓ 〃 많다 .
22
Vector Space• Vector 를 이용하여 어느 기준
위치로부터 얼마만큼 어느 방향으로 떨어져 있는지 측정가능
• 어떻게 문서들간의 similarity 를 계산할 것이냐를 보다 더 수학적으로 접근해 보자는 의미에서 등장함 .
23
정보
검색
질의어문서2
문서 1
단순하게 두개의 색인어를 기초로 한 2 차원 평면 고려 .
- ( 정보 , 검색 ) 좌표계
-문서 1 : D1(0.8,0.3 )
-문서 2 : D2(0.2,0.7 )
-질의어 : 정보검색 Q(0.4, 0.8)
-문서 1, 문서 2 누가 더 가까울까 ?
24
• Keyword 3 개• Query 와 가장 가까운 문서는
D1
25
Calculating TF-IDF Weighting
• TF – Term frequency• IDF – Inverse document frequency
• idf k = log ( Ndoc / Dk )• W kd = F kd * idf k
• F kd : the frequency with which keyword k occurs in docu. d
• Ndoc : the total number of document in the corpus
• Dk : the number of documents containing keyword k
26
SMART Weighting Specification
27
inverse
squared
probabilistic
frequency