1 A Survey on Text Categorization 산업공학과 진재훈. Evaluation Summary Appendix Text cate gorization Document indexing Inductive learning Reference 2 Contents

1

A Survey onText Categorization

산업공학과 진재훈

2

Evaluation

Summary

Appendix

Text cate gorizationDocument indexingInductive learning

Reference

Contents

1. Text categorization2. Machine learning approach for TC

1.Document Indexing Preprocessing Dimensionality reduction Weighting

2.Inductive learning3.Evaluation

3. Summary of result4. Appendix

An algorithm for suffix stripping5. Reference

3

Evaluation

Summary

Appendix


Reference

Text Categorization (1)

Automated assigning of natural language texts to predefined categories (or classes) based on content

The categories are just symbolic labels No exogenous knowledge

},{:function rget unknown ta theeApproximat

.categories predefined

ofset a is },...,{ and documents ofdomain a is

where,,pair each tolueboolean va a Assiging

||1

FTCD

ccCD

CDcd

C

ij

4

Evaluation

Summary

Appendix


Reference

Text Categorization (2)

Single-label vs Multilabel Category-pivoted vs Document-pivoted Hard (fully automated) vs Ranking (semiautomated)

Document IndexingDifferent ways to understand what a term

is;Different ways to compute term weights

Inductive construction of

text classifiers

Evaluation of text classifiers

Preprocessing

Dimensionality reduction

Weighting

5

Evaluation

Summary

Appendix


Reference

Applications

Automatic indexing for boolean information retrieval systems

Document organization Text filtering Word sense disambiguation Hierarchical categorization of web pages Automated survey coding Automated authorship attribution and genre

classification Spam filtering Help-desk support Knowledge management Focused crawling

6

Evaluation

Summary

Appendix


Reference

Topics

Machine learning methods for text categorization Theoretical models of text categorization Hierarchical text categorization Text analysis and indexing methods for text

categorization Dimensionality reduction for text categorization Evaluation issues in text categorization Applications of text categorization Automated categorization of Web pages and Web

sites Text filtering and routing Topic detection and tracking Spoken text categorization

7

Evaluation

Summary

Appendix


Reference

Database

Reuters collection OHSUMED collection 20 Newsgroup collection AP collection

8

Evaluation

Summary

Appendix


Reference

Problem & Issues

Large number of attributes Large number of training samples Attribute dependency Multi-modality of categories

Natural language properties synonymy, ambiguity, skewed

distributions

9

Evaluation

Summary

Appendix


Reference

Data Representation (1)

Attribute-value representation of text

By terms Ex) Term Frequency TF(wi,d)

wi : word d : count

Sequence is ignored Bag of words Cf. String kernel

10

Evaluation

Summary

Appendix


Reference

Data Representation (2)

By phrases Syntactically or Statistically Discouraging

Superior semantic quality Inferior statistical quality More terms, more synonymous terms or

nearly synonymous terms, lower consistency of assignment, and lower document frequency

11

Evaluation

Summary

Appendix


Reference

Preprocessing

Need to extract content word

Case Stop word

a, the, that, it, is, are, etc. Stem

CONNECT, CONNECTED, CONNECTING, CONNECTION, CONNECTIONS ⇒ similar meanings

Need an algorithm for suffix stripping

12

Evaluation

Summary

Appendix


Reference

Dimensionality reduction

Why? Reduce overfitting Reduce Computational load

Type By method

Feature selection Ex) Document Frequency, Mutual Information,

… Feature extraction

Ex) Term Clustering, LSI (Latent Semantic Indexing)…

By scope Local : Different feature set at each category Global : Same feature set for all categories

Document frequency Mutual information Information gain Χ2 statistics

Term Clustering Latent Semantic

Indexing

목적 : 어떤 category 의pos., neg. example 들에가장 차이가 많이 나게분포하는 feature 를 찾는 것

목적 : 어떤 category 의pos., neg. example 들에가장 차이가 많이 나게분포하는 feature 를 찾는 것

13

Evaluation

Summary

Appendix


Reference

Document frequency

The number of documents in which each index term occurs

Select the terms which ≥ threshold Characteristic

Assumption : Term with less frequency has less effect on the performance

Simple and small computational load But oppose against the widely accepted

assumption that the term less frequency has more information

단어 t 가 나타난문서의 수단어 t 가 나타난문서의 수

14

Evaluation

Summary

Appendix


Reference

Mutual information

Definition

Characteristic Assumption : High occurrence of t in c gives high

information of t in c Less P(t) term has more information

))((log

)()(

)(log),(

BACA

NA

cPtP

ctPctI

m

iiiAVG ctIcPtI

1

),()()(

),()(max)( 1 iimiMAX ctIcPtI

A CB D

t~t

c ~c

문서에서 단어 t 의 출현이문서의 범주가 c 인지의여부를 예측하는데제공하는 정보량

문서에서 단어 t 의 출현이문서의 범주가 c 인지의여부를 예측하는데제공하는 정보량

15

Evaluation

Summary

Appendix


Reference

Information gain

Characteristic Consider not only occurrence but also non-

occurrence at each category Better performance than MI, generally

)}]|(log)|(){(

)}|(log)|(){([

)}(log)({

)Entropy( Expected)(EntropyGain(t)

1

1

1

tcPtcPtP

tcPtcPtP

cPcP

SS

i

m

i i

i

m

i i

i

m

i i

t

단어 t 를 feature 로 선택했을 때 감소되는엔트로피의 양

단어 t 를 feature 로 선택했을 때 감소되는엔트로피의 양

16

Evaluation

Summary

Appendix


Reference

Χ2 statistics

Characteristic If t and c are independent, value is 0

))()()((

)( 2

DCBADBCA

CBADN

),()()( 2

1

2i

m

iiAVG ctcPt

)},({max)( 21

2i

miMAX ctt

),(2 ct

A CB D

t~t

c ~c단어의 범주에 따른 분포에 대한 기대치와 관측치의 차이를 측정하는 검정 통계량

단어의 범주에 따른 분포에 대한 기대치와 관측치의 차이를 측정하는 검정 통계량

17

Evaluation

Summary

Appendix


Reference

Others 1st : NGL, OR, GSS2nd : Χ2, IG3rd : MI

1st : NGL, OR, GSS2nd : Χ2, IG3rd : MI

18

Evaluation

Summary

Appendix


Reference

Term Clustering

Term clustering tries to group words with a high degree of pairwise semantic relatedness, so that the groups (or their centroids, or a representative of them) may be used instead of the terms as dimensions of the vector space

Less effectiveness loss with a high aggressivity

Even showed some effectiveness improvement with less aggressive levels of reduction

의미적 유사성이 높은단어들을 grouping 하여그 centroids 나 , 다른표현을 feature 로 사용

의미적 유사성이 높은단어들을 grouping 하여그 centroids 나 , 다른표현을 feature 로 사용

19

Evaluation

Summary

Appendix


Reference

Latent Semantic Indexing

Compresses document vectors into vectors of a lower-dimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of co-occurrence

infers the dependence among the original terms from a corpus and “wires” this dependence into the newly obtained, independent dimensions

Co-occurrence 패턴에따라 original feature 를조합하되 , 새로 만들어진feature 들이 독립적이게

Co-occurrence 패턴에따라 original feature 를조합하되 , 새로 만들어진feature 들이 독립적이게

20

Evaluation

Summary

Appendix


Reference

Weighting

Boolean weighting Word frequency weighting TF-IDF weighting Entropy weighting Inverse category frequency

weighting

21

Evaluation

Summary

Appendix


Reference

Boolean weighting

Less computational load

αik : i 번째 문서의 k 번째 feature 의 weight

문서에 나타나면 1나타나지 않으면 0문서에 나타나면 1나타나지 않으면 0

22

Evaluation

Summary

Appendix


Reference

Word frequency weighting

Probabilistic model use Ex) Naïve Bayes’ classifier

몇 번이나 나타났는가 ?몇 번이나 나타났는가 ?

23

Evaluation

Summary

Appendix


Reference

TF-IDF weighting

Term Freq. x Inverse Document Freq. Widely used But document length affects word

frequency → Normalize by document length

문서에 얼마나 자주나타나며 문서 간의분리도가 얼마나 높은가

문서에 얼마나 자주나타나며 문서 간의분리도가 얼마나 높은가

24

Evaluation

Summary

Appendix


Reference

Entropy weighting

Log-based freq. x (1 - feature entropy) Feature entropy (Underbar term)

Same distribution on all document → 1 Occur only one document → 0

문서에 얼마나 자주나타나며 entropy 를얼마나 감소시키는가

문서에 얼마나 자주나타나며 entropy 를얼마나 감소시키는가

25

Evaluation

Summary

Appendix


Reference

ICF weighting

Inverse Category Frequency Focused on separation between categories

문서에 얼마나 자주나타나며 category 간의분리도가 얼마나 높은가

문서에 얼마나 자주나타나며 category 간의분리도가 얼마나 높은가

26

Evaluation

Summary

Appendix


Reference

Classifier Building Approaches

Building1. Define CSV (Categorization Status Value)

function Return score {0, 1}, the value d belong to c

2. Define Threshold Analytically Experimentally

Naïve Bayes’ Classifier Decision Tree Regression Rocchio Neural network K-Nearest Neighbor SVM

27

Evaluation

Summary

Appendix


Reference

Naïve Bayes’ Classifier

Naïve?

Assumption & Limit Independency between words Unimodal density

Improvement Relax binary-value restriction Introduce document length normalization Relax independence assumption

dj 로 표현되는 문서가ci 에 속할 확률dj 로 표현되는 문서가ci 에 속할 확률

28

Evaluation

Summary

Appendix


Reference

Decision Tree

Limit If the number of samples is relatively small with respect to

the distinguishing words, overfitting occurs

각 노드에서 어떤Feature 를 선택할 것인가 ?(ID3, C4.5 : 가장 많은information gain 을 주는feature 를 선택 )

각 노드에서 어떤Feature 를 선택할 것인가 ?(ID3, C4.5 : 가장 많은information gain 을 주는feature 를 선택 )

29

Evaluation

Summary

Appendix


Reference

Rocchio

β, γ are a parameter that adjusts the impact of positive and negative training examples

To classify a new document d, the cosine between w and d is computed. Using an appropriate threshold on the cosine leads to a binary classification rule

Generally, β > γ (16, 4)

Positive 혹은 NegativeExample 들의 center 와의유사성 ( 거리 ) 을 기준으로 판별

Positive 혹은 NegativeExample 들의 center 와의유사성 ( 거리 ) 을 기준으로 판별

30

Evaluation

Summary

Appendix


Reference

K-Nearest Neighbor

The cosine is used as a similarity metric knn(d’) denotes the indexes of the k

documents which have the highest cosine with the document to classify d

Lazy learning Much load when classification

kNNt

jjiic

j

ctytdsimdCSV ),(),()(

가장 유사한 k 개의 positiveexample 을 기준으로 판별가장 유사한 k 개의 positiveexample 을 기준으로 판별

31

Evaluation

Summary

Appendix


Reference

Using negative evidence

No outstanding result 당연히 멀리 있어 영향력이 적다

32

Evaluation

Summary

Appendix


Reference

WAKNN Classification

Weight adjusted k-Nearest Neighbor WAKNN-F

Reduce the number of words used WAKNN-C

Reduce the cost of evaluation

33

Evaluation

Summary

Appendix


Reference

Rocchio & K-NN

분류속도는 빠르나표현력이 낮다

분류속도는 느리나표현력이 높다

34

Evaluation

Summary

Appendix


Reference

SVM

Advantage High dimensional

input space Few irrelevant

features Document vectors

are sparse Most text

categorization problems are linearly separable

35

Evaluation

Summary

Appendix


Reference

Transductive Boosting

The margin in boosting is the sum of the distances between every training data and optimal classification bound. Boosting tries to maximize the average of all margins.

Transductive boosting method might be appropriate particularly when we do not know the ratio of positive and negative examples in the test data

36

Evaluation

Summary

Appendix


Reference

Evaluation & Measure

Evaluation of TC systems is experimental rather than analytical Analytical evaluation is difficult due to the

subjective nature of the task Experimental evaluation aims at

measuring classier effectiveness, that is, Its ability to make correct classification

decisions for the largest possible number of documents

37

Evaluation

Summary

Appendix


Reference

Precision & Recall

Precision (π), with respect to category ci may be defined as the following conditional probability: π = P((dx; ci) = Tj(dx; ci) = T)

the probability that if a document dx has been classified as ci this decision is correct

Analogously, recall (ρ), may be defined as follows: ρ = P((dx; ci) = Tj(dx; ci) = T)

the probability that if a random document is meant to be led under ci, it dx will be classified as such.

고른 것 중에 진짜 맞는 것의 비율맞는 것 중에 골라낸 비율고른 것 중에 진짜 맞는 것의 비율맞는 것 중에 골라낸 비율

38

Evaluation

Summary

Appendix


Reference

Estimating Precision & Recall

39

Evaluation

Summary

Appendix


Reference

Effectiveness averaging

Two different methods may be used to calculate global values for π and ρ

Macroaveraging:

where πi and ρi are local scores.

Category 별로 구하고 평균모든 cateogory 에 균일한weight 를 부과하는 효과

Category 별로 구하고 평균모든 cateogory 에 균일한weight 를 부과하는 효과

40

Evaluation

Summary

Appendix


Reference

Microaveraging

Precision microaveraging is calculated as follows:

Recall microaveraging is calculated as follows:

Category 별 TP, FP, FN 를합산하여 그 값을 이용모든 문서에 균일한weight 를 부과하는 효과

Category 별 TP, FP, FN 를합산하여 그 값을 이용모든 문서에 균일한weight 를 부과하는 효과

41

Evaluation

Summary

Appendix


Reference

Combining precision and recall

Neither π or ρ makes much sense in isolation

Classifiers can be tuned to maximize one at the expense of the other

TC evaluation is done in terms of measures that combine π and ρ.

We will examine two such measures: breakeven point and the F functions

42

Evaluation

Summary

Appendix


Reference

Single Value Measure

Trade-off between Precision and Recall Breakeven point

the value at which π equals ρ F-measure

Precision = Recall 이되는 point, 혹은 가장가까울 때 두 값의 평균

Precision = Recall 이되는 point, 혹은 가장가까울 때 두 값의 평균

pr

rpF

21

Precision 과 Recall 의조화평균 (F1)Precision 과 Recall 의조화평균 (F1)

43

Evaluation

Summary

Appendix


Reference

Other measures

Since one knows (by experimentation) which documents fall into: TP, FP, TN, FN, one may also estimate Accuracy (A) and Error (E):

These measures, however, are not widely used in TC due to the fact that they are less sensitive to variations in the number of correct decisions than and

44

Evaluation

Summary

Appendix


Reference

Fallout

A less frequently used measure is fallout: Fallout measures the proportion of non-

targeted items that were mistakenly selected

In certain fields recall-fallout trade-offs are more common than precision-recall ones 잘못된 것 중 틀린 것을

옳다고 판단한 비율사항의 중요도에 따라의미 있을 때도 있음

잘못된 것 중 틀린 것을옳다고 판단한 비율사항의 중요도에 따라의미 있을 때도 있음

45

Evaluation

Summary

Appendix


Reference

Alternatives to effectiveness

Efficiency is often used as an additional criterion in TC evaluation. It may be measured with respect to: training or classification

The utility criterion, from decision theory, is sometimes used. An obvious example of application of utility

measures is email spam filtering, where failing to discard a FP is less serious than discarding a legitimate message (FN)

46

Evaluation

Summary

Appendix


Reference

Summary of results (1)

47

Evaluation

Summary

Appendix


Reference

Summary of results (2)

With respect to the TC collection on which those classifiers were tested [Sebastiani, 2002]: SVM, boosting-based ensembles, example-

based and regression methods appear to deliver the best performance

Neural nets and on-line classifiers perform slightly worse

Naive Bayes and Rocchio performed poorly Although Decision Trees were not tested on a

sufficient number of corpora, results seem to be encouraging

48

Evaluation

Summary

Appendix


Reference

An algorithm for suffix stripping (1)

Two points1. The suffixes are being removed simply

to improve IR performance, and not as a linguistic exercise

– CONNECTION and CONNECTIONS– RELATE and RELATIVITY

2. The success rate for the suffix stripping will be significantly less than 100%

– SAND and SANDER– WAND and WANDER

49

Evaluation

Summary

Appendix


Reference


Consonant & Vowel ccc → C vvv → V

Any word has one of the four forms CVCV…C, CVCV…V, VCVC…C, VCVC…V → [C]VCVC ... [V] → [C](VC){m}[V]

m = 0 : TR, EE, TREE m = 1 : TROUBLE, OATS m = 2 : TROUBLES, OATEN

50

Evaluation

Summary

Appendix


Reference


Rule (condition) S1 → S2 Condition

*S : the stem ends with S *v* : the stem contains a vowel *d : the stem ends with a double consonant *o : the stem ends cvc, where the second c is

not W, X, Y

(m>1 and (*S or *T)) (*d and not (*L or *S or *Z))

51

Evaluation

Summary

Appendix


Reference


Rules Step 1a

SSES -> SS caresses -> caress IES -> I ponies -> poni

Step 1b (m>0) EED -> EE feed -> feed agreed -> agree

Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow

Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll

52

Evaluation

Summary

Appendix


Reference

Reference

FABRIZIO SEBASTIANI, Machine Learning in Automated Text Categorization, 2002

Thorsten Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, 1998

Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar, Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification, 2001

David D. Lewis and Marc Ringuette, A Comparison of Two Learning Algorithms for Text Categorization, 1994

Luigi Galavotti, Fabrizio Sebastiani, and Maria Simi, Experiment on the Use of Feature Selection and Negative Evidence in Automated Text Categorization

Hirotoshi Taira and Masahiko Haruno, Text Categorization Using Transductive Boosting, 2001

Sally Jo Cunningham, James Littin and Ian H. Witten, APPLICATIONS OF MACHINE LEARNING IN INFORMATION RETRIEVAL, 2000

M.F.Porter, An algorithm for suffix stripping, 1980

53

Evaluation

Summary

Appendix


Reference

Reference (WWW)

http://www.cs.cornell.edu/People/tj/ http://faure.iei.pi.cnr.it/~fabrizio/

Documents

1 A Survey on Text Categorization 산업공학과 진재훈. Evaluation Summary Appendix Text cate gorization Document indexing Inductive learning Reference 2 Contents