Upload
ilene-mcbride
View
220
Download
1
Embed Size (px)
Citation preview
1
A Survey onText Categorization
산업공학과 진재훈
2
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Contents
1. Text categorization2. Machine learning approach for TC
1.Document Indexing Preprocessing Dimensionality reduction Weighting
2.Inductive learning3.Evaluation
3. Summary of result4. Appendix
An algorithm for suffix stripping5. Reference
3
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Text Categorization (1)
Automated assigning of natural language texts to predefined categories (or classes) based on content
The categories are just symbolic labels No exogenous knowledge
},{:function rget unknown ta theeApproximat
.categories predefined
ofset a is },...,{ and documents ofdomain a is
where,,pair each tolueboolean va a Assiging
||1
FTCD
ccCD
CDcd
C
ij
4
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Text Categorization (2)
Single-label vs Multilabel Category-pivoted vs Document-pivoted Hard (fully automated) vs Ranking (semiautomated)
Document IndexingDifferent ways to understand what a term
is;Different ways to compute term weights
Inductive construction of
text classifiers
Evaluation of text classifiers
Preprocessing
Dimensionality reduction
Weighting
5
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Applications
Automatic indexing for boolean information retrieval systems
Document organization Text filtering Word sense disambiguation Hierarchical categorization of web pages Automated survey coding Automated authorship attribution and genre
classification Spam filtering Help-desk support Knowledge management Focused crawling
6
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Topics
Machine learning methods for text categorization Theoretical models of text categorization Hierarchical text categorization Text analysis and indexing methods for text
categorization Dimensionality reduction for text categorization Evaluation issues in text categorization Applications of text categorization Automated categorization of Web pages and Web
sites Text filtering and routing Topic detection and tracking Spoken text categorization
7
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Database
Reuters collection OHSUMED collection 20 Newsgroup collection AP collection
8
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Problem & Issues
Large number of attributes Large number of training samples Attribute dependency Multi-modality of categories
Natural language properties synonymy, ambiguity, skewed
distributions
9
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Data Representation (1)
Attribute-value representation of text
By terms Ex) Term Frequency TF(wi,d)
wi : word d : count
Sequence is ignored Bag of words Cf. String kernel
10
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Data Representation (2)
By phrases Syntactically or Statistically Discouraging
Superior semantic quality Inferior statistical quality More terms, more synonymous terms or
nearly synonymous terms, lower consistency of assignment, and lower document frequency
11
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Preprocessing
Need to extract content word
Case Stop word
a, the, that, it, is, are, etc. Stem
CONNECT, CONNECTED, CONNECTING, CONNECTION, CONNECTIONS ⇒ similar meanings
Need an algorithm for suffix stripping
12
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Dimensionality reduction
Why? Reduce overfitting Reduce Computational load
Type By method
Feature selection Ex) Document Frequency, Mutual Information,
… Feature extraction
Ex) Term Clustering, LSI (Latent Semantic Indexing)…
By scope Local : Different feature set at each category Global : Same feature set for all categories
Document frequency Mutual information Information gain Χ2 statistics
Term Clustering Latent Semantic
Indexing
목적 : 어떤 category 의pos., neg. example 들에가장 차이가 많이 나게분포하는 feature 를 찾는 것
목적 : 어떤 category 의pos., neg. example 들에가장 차이가 많이 나게분포하는 feature 를 찾는 것
13
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Document frequency
The number of documents in which each index term occurs
Select the terms which ≥ threshold Characteristic
Assumption : Term with less frequency has less effect on the performance
Simple and small computational load But oppose against the widely accepted
assumption that the term less frequency has more information
단어 t 가 나타난문서의 수단어 t 가 나타난문서의 수
14
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Mutual information
Definition
Characteristic Assumption : High occurrence of t in c gives high
information of t in c Less P(t) term has more information
))((log
)()(
)(log),(
BACA
NA
cPtP
ctPctI
m
iiiAVG ctIcPtI
1
),()()(
),()(max)( 1 iimiMAX ctIcPtI
A CB D
t~t
c ~c
문서에서 단어 t 의 출현이문서의 범주가 c 인지의여부를 예측하는데제공하는 정보량
문서에서 단어 t 의 출현이문서의 범주가 c 인지의여부를 예측하는데제공하는 정보량
15
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Information gain
Characteristic Consider not only occurrence but also non-
occurrence at each category Better performance than MI, generally
)}]|(log)|(){(
)}|(log)|(){([
)}(log)({
)Entropy( Expected)(EntropyGain(t)
1
1
1
tcPtcPtP
tcPtcPtP
cPcP
SS
i
m
i i
i
m
i i
i
m
i i
t
단어 t 를 feature 로 선택했을 때 감소되는엔트로피의 양
단어 t 를 feature 로 선택했을 때 감소되는엔트로피의 양
16
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Χ2 statistics
Characteristic If t and c are independent, value is 0
))()()((
)( 2
DCBADBCA
CBADN
),()()( 2
1
2i
m
iiAVG ctcPt
)},({max)( 21
2i
miMAX ctt
),(2 ct
A CB D
t~t
c ~c단어의 범주에 따른 분포에 대한 기대치와 관측치의 차이를 측정하는 검정 통계량
단어의 범주에 따른 분포에 대한 기대치와 관측치의 차이를 측정하는 검정 통계량
17
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Others 1st : NGL, OR, GSS2nd : Χ2, IG3rd : MI
1st : NGL, OR, GSS2nd : Χ2, IG3rd : MI
18
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Term Clustering
Term clustering tries to group words with a high degree of pairwise semantic relatedness, so that the groups (or their centroids, or a representative of them) may be used instead of the terms as dimensions of the vector space
Less effectiveness loss with a high aggressivity
Even showed some effectiveness improvement with less aggressive levels of reduction
의미적 유사성이 높은단어들을 grouping 하여그 centroids 나 , 다른표현을 feature 로 사용
의미적 유사성이 높은단어들을 grouping 하여그 centroids 나 , 다른표현을 feature 로 사용
19
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Latent Semantic Indexing
Compresses document vectors into vectors of a lower-dimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of co-occurrence
infers the dependence among the original terms from a corpus and “wires” this dependence into the newly obtained, independent dimensions
Co-occurrence 패턴에따라 original feature 를조합하되 , 새로 만들어진feature 들이 독립적이게
Co-occurrence 패턴에따라 original feature 를조합하되 , 새로 만들어진feature 들이 독립적이게
20
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Weighting
Boolean weighting Word frequency weighting TF-IDF weighting Entropy weighting Inverse category frequency
weighting
21
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Boolean weighting
Less computational load
αik : i 번째 문서의 k 번째 feature 의 weight
문서에 나타나면 1나타나지 않으면 0문서에 나타나면 1나타나지 않으면 0
22
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Word frequency weighting
Probabilistic model use Ex) Naïve Bayes’ classifier
몇 번이나 나타났는가 ?몇 번이나 나타났는가 ?
23
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
TF-IDF weighting
Term Freq. x Inverse Document Freq. Widely used But document length affects word
frequency → Normalize by document length
문서에 얼마나 자주나타나며 문서 간의분리도가 얼마나 높은가
문서에 얼마나 자주나타나며 문서 간의분리도가 얼마나 높은가
24
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Entropy weighting
Log-based freq. x (1 - feature entropy) Feature entropy (Underbar term)
Same distribution on all document → 1 Occur only one document → 0
문서에 얼마나 자주나타나며 entropy 를얼마나 감소시키는가
문서에 얼마나 자주나타나며 entropy 를얼마나 감소시키는가
25
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
ICF weighting
Inverse Category Frequency Focused on separation between categories
문서에 얼마나 자주나타나며 category 간의분리도가 얼마나 높은가
문서에 얼마나 자주나타나며 category 간의분리도가 얼마나 높은가
26
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Classifier Building Approaches
Building1. Define CSV (Categorization Status Value)
function Return score {0, 1}, the value d belong to c
2. Define Threshold Analytically Experimentally
Naïve Bayes’ Classifier Decision Tree Regression Rocchio Neural network K-Nearest Neighbor SVM
27
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Naïve Bayes’ Classifier
Naïve?
Assumption & Limit Independency between words Unimodal density
Improvement Relax binary-value restriction Introduce document length normalization Relax independence assumption
dj 로 표현되는 문서가ci 에 속할 확률dj 로 표현되는 문서가ci 에 속할 확률
28
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Decision Tree
Limit If the number of samples is relatively small with respect to
the distinguishing words, overfitting occurs
각 노드에서 어떤Feature 를 선택할 것인가 ?(ID3, C4.5 : 가장 많은information gain 을 주는feature 를 선택 )
각 노드에서 어떤Feature 를 선택할 것인가 ?(ID3, C4.5 : 가장 많은information gain 을 주는feature 를 선택 )
29
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Rocchio
β, γ are a parameter that adjusts the impact of positive and negative training examples
To classify a new document d, the cosine between w and d is computed. Using an appropriate threshold on the cosine leads to a binary classification rule
Generally, β > γ (16, 4)
Positive 혹은 NegativeExample 들의 center 와의유사성 ( 거리 ) 을 기준으로 판별
Positive 혹은 NegativeExample 들의 center 와의유사성 ( 거리 ) 을 기준으로 판별
30
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
K-Nearest Neighbor
The cosine is used as a similarity metric knn(d’) denotes the indexes of the k
documents which have the highest cosine with the document to classify d
Lazy learning Much load when classification
kNNt
jjiic
j
ctytdsimdCSV ),(),()(
가장 유사한 k 개의 positiveexample 을 기준으로 판별가장 유사한 k 개의 positiveexample 을 기준으로 판별
31
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Using negative evidence
No outstanding result 당연히 멀리 있어 영향력이 적다
32
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
WAKNN Classification
Weight adjusted k-Nearest Neighbor WAKNN-F
Reduce the number of words used WAKNN-C
Reduce the cost of evaluation
33
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Rocchio & K-NN
분류속도는 빠르나표현력이 낮다
분류속도는 느리나표현력이 높다
34
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
SVM
Advantage High dimensional
input space Few irrelevant
features Document vectors
are sparse Most text
categorization problems are linearly separable
35
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Transductive Boosting
The margin in boosting is the sum of the distances between every training data and optimal classification bound. Boosting tries to maximize the average of all margins.
Transductive boosting method might be appropriate particularly when we do not know the ratio of positive and negative examples in the test data
36
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Evaluation & Measure
Evaluation of TC systems is experimental rather than analytical Analytical evaluation is difficult due to the
subjective nature of the task Experimental evaluation aims at
measuring classier effectiveness, that is, Its ability to make correct classification
decisions for the largest possible number of documents
37
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Precision & Recall
Precision (π), with respect to category ci may be defined as the following conditional probability: π = P((dx; ci) = Tj(dx; ci) = T)
the probability that if a document dx has been classified as ci this decision is correct
Analogously, recall (ρ), may be defined as follows: ρ = P((dx; ci) = Tj(dx; ci) = T)
the probability that if a random document is meant to be led under ci, it dx will be classified as such.
고른 것 중에 진짜 맞는 것의 비율맞는 것 중에 골라낸 비율고른 것 중에 진짜 맞는 것의 비율맞는 것 중에 골라낸 비율
38
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Estimating Precision & Recall
39
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Effectiveness averaging
Two different methods may be used to calculate global values for π and ρ
Macroaveraging:
where πi and ρi are local scores.
Category 별로 구하고 평균모든 cateogory 에 균일한weight 를 부과하는 효과
Category 별로 구하고 평균모든 cateogory 에 균일한weight 를 부과하는 효과
40
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Microaveraging
Precision microaveraging is calculated as follows:
Recall microaveraging is calculated as follows:
Category 별 TP, FP, FN 를합산하여 그 값을 이용모든 문서에 균일한weight 를 부과하는 효과
Category 별 TP, FP, FN 를합산하여 그 값을 이용모든 문서에 균일한weight 를 부과하는 효과
41
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Combining precision and recall
Neither π or ρ makes much sense in isolation
Classifiers can be tuned to maximize one at the expense of the other
TC evaluation is done in terms of measures that combine π and ρ.
We will examine two such measures: breakeven point and the F functions
42
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Single Value Measure
Trade-off between Precision and Recall Breakeven point
the value at which π equals ρ F-measure
Precision = Recall 이되는 point, 혹은 가장가까울 때 두 값의 평균
Precision = Recall 이되는 point, 혹은 가장가까울 때 두 값의 평균
pr
rpF
21
Precision 과 Recall 의조화평균 (F1)Precision 과 Recall 의조화평균 (F1)
43
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Other measures
Since one knows (by experimentation) which documents fall into: TP, FP, TN, FN, one may also estimate Accuracy (A) and Error (E):
These measures, however, are not widely used in TC due to the fact that they are less sensitive to variations in the number of correct decisions than and
44
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Fallout
A less frequently used measure is fallout: Fallout measures the proportion of non-
targeted items that were mistakenly selected
In certain fields recall-fallout trade-offs are more common than precision-recall ones 잘못된 것 중 틀린 것을
옳다고 판단한 비율사항의 중요도에 따라의미 있을 때도 있음
잘못된 것 중 틀린 것을옳다고 판단한 비율사항의 중요도에 따라의미 있을 때도 있음
45
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Alternatives to effectiveness
Efficiency is often used as an additional criterion in TC evaluation. It may be measured with respect to: training or classification
The utility criterion, from decision theory, is sometimes used. An obvious example of application of utility
measures is email spam filtering, where failing to discard a FP is less serious than discarding a legitimate message (FN)
46
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Summary of results (1)
47
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Summary of results (2)
With respect to the TC collection on which those classifiers were tested [Sebastiani, 2002]: SVM, boosting-based ensembles, example-
based and regression methods appear to deliver the best performance
Neural nets and on-line classifiers perform slightly worse
Naive Bayes and Rocchio performed poorly Although Decision Trees were not tested on a
sufficient number of corpora, results seem to be encouraging
48
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
An algorithm for suffix stripping (1)
Two points1. The suffixes are being removed simply
to improve IR performance, and not as a linguistic exercise
– CONNECTION and CONNECTIONS– RELATE and RELATIVITY
2. The success rate for the suffix stripping will be significantly less than 100%
– SAND and SANDER– WAND and WANDER
49
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
An algorithm for suffix stripping (2)
Consonant & Vowel ccc → C vvv → V
Any word has one of the four forms CVCV…C, CVCV…V, VCVC…C, VCVC…V → [C]VCVC ... [V] → [C](VC){m}[V]
m = 0 : TR, EE, TREE m = 1 : TROUBLE, OATS m = 2 : TROUBLES, OATEN
50
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
An algorithm for suffix stripping (3)
Rule (condition) S1 → S2 Condition
*S : the stem ends with S *v* : the stem contains a vowel *d : the stem ends with a double consonant *o : the stem ends cvc, where the second c is
not W, X, Y
(m>1 and (*S or *T)) (*d and not (*L or *S or *Z))
51
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
An algorithm for suffix stripping (4)
Rules Step 1a
SSES -> SS caresses -> caress IES -> I ponies -> poni
Step 1b (m>0) EED -> EE feed -> feed agreed -> agree
Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow
Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll
52
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Reference
FABRIZIO SEBASTIANI, Machine Learning in Automated Text Categorization, 2002
Thorsten Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, 1998
Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar, Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification, 2001
David D. Lewis and Marc Ringuette, A Comparison of Two Learning Algorithms for Text Categorization, 1994
Luigi Galavotti, Fabrizio Sebastiani, and Maria Simi, Experiment on the Use of Feature Selection and Negative Evidence in Automated Text Categorization
Hirotoshi Taira and Masahiko Haruno, Text Categorization Using Transductive Boosting, 2001
Sally Jo Cunningham, James Littin and Ian H. Witten, APPLICATIONS OF MACHINE LEARNING IN INFORMATION RETRIEVAL, 2000
M.F.Porter, An algorithm for suffix stripping, 1980
53
Evaluation
Summary
Appendix
Text cate gorizationDocument indexingInductive learning
Reference
Reference (WWW)
http://www.cs.cornell.edu/People/tj/ http://faure.iei.pi.cnr.it/~fabrizio/