Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build...

Preview:

Citation preview

Text Categorization

PengBo10/31/2010

本次课大纲

Text Categorization Problem definition Build a Classifier

Naïve Bayes Classifier K-Nearest Neighbor Classifier

Evaluation

Definition

Given: 实例 instance, xX, where X is the instance

language or instance space. Issue: how to represent text documents.

固定的类别集合 categories: C = {c1, c2,…, cn}

Determine: The category of x : c(x)C,

where c(x) is a categorization function 分类函数 We want to know how to build categorization

functions (“classifiers 分类器” ).

Text Categorization Examples

Assign labels to each document or web-page: Labels are most often topics such as Yahoo-categories

e.g., "finance," "sports," "news>world>asia>business" Labels may be genres

e.g., "editorials" "movie-reviews" "news“ Labels may be opinion

e.g., “like”, “hate”, “neutral” Labels may be domain-specific binary

e.g., "interesting-to-me" : "not-interesting-to-me” e.g., “spam” : “not-spam” e.g., “contains adult language” :“doesn’t”

Classification Methods

人工分类 Manual classification Used by Yahoo!, Looksmart, about.com, ODP, Medline Accurate but expensive to scale

自动文本分类 Automatic document classification 基于规则: Hand-coded rule-based systems

Spam mail filter,… 有监督的学习: Supervised learning of a document-

label assignment function No free lunch: requires 人工标注的训练集 hand-

classified training data Note that many commercial systems use a

mixture of methods

Think about it…

How to represent text documents and categories? Vectors & Regions String & Language (models)

How to build categorization functions ? Closeness/Similarity to regions Probability to generate the

string/language model

K-Nearest Neighbors

Classes in a Vector Space

Government

Science

Arts

Classification Using Vector Spaces

Each training doc a point (vector) labeled by its topic (= class)

Hypothesis: docs of the same class form a contiguous region of space

We define surfaces to delineate classes in space

Test Document = Government

Government

Science

Arts

Similarityhypothesistrue ingeneral?

k Nearest Neighbor Classification

To classify document d into class c Define k-neighborhood N as k nearest

neighbors of d Count number of documents i in N that belong

to c Estimate P(c|d) as i/k Choose as class argmaxc P(c|d) [ = majority

class]

Example: k=6 (6NN)

Government

Science

Arts

P(science| )?

Nearest-Neighbor Learning Algorithm

Learning is just storing the representations of the training examples in D.

Testing instance x: Compute similarity between x and all examples

in D. Assign x the category of the most similar

example in D. Does not explicitly compute a generalization

or category prototypes. Also called:

Case-based learning Memory-based learning Lazy learning

Why K?

Using only the closest example to determine the categorization is subject to errors due to: A single atypical example. Noise (i.e. error) in the category label of a single

training example. More robust alternative is to find the k

most-similar examples and return the majority category of these k examples.

Value of k is typically odd to avoid ties; 3 and 5 are most common.

kNN decision boundaries

Government

Science

Arts

Boundaries are in principle arbitrary surfaces – but usually polyhedra

Similarity Metrics

Nearest neighbor method depends on a similarity (or distance) metric. Simplest for continuous m-dimensional instance

space is Euclidean distance. Simplest for m-dimensional binary instance

space is Hamming distance (number of feature values that differ).

For text, cosine similarity of tf.idf weighted vectors is typically most effective.

Illustration of 3 Nearest Neighbor for Text Vector Space

Nearest Neighbor with Inverted Index

Naively finding nearest neighbors requires a linear search through |D| documents in collection

But determining k nearest neighbors is the same as determining the top-k best retrievals using the test document as a query to a database of training documents.

Use standard vector space inverted index methods to find the k nearest neighbors.

Testing Time: O(B|Vt|) where B is the average number of training documents in which a test-document word appears. Typically B << |D|

kNN: Discussion

No training necessary No feature selection necessary Scales well with large number of classes

Don’t need to train n classifiers for n classes Classes can influence each other

Small changes to one class can have ripple effect

Scores can be hard to convert to probabilities

Naïve Bayes

Bayes Classifiers

Task: Classify a new instance D based on a tuple of attribute values into one of the classes cj C nxxxD ,,, 21

),,,|(argmax 21 njCc

MAP xxxcPcj

),,,(

)()|,,,(argmax

21

21

n

jjn

Cc xxxP

cPcxxxP

j

)()|,,,(argmax 21 jjnCc

cPcxxxPj

Naïve Bayes Assumption

P(cj) Can be estimated from the frequency of classes

in the training examples. P(x1,x2,…,xn|cj)

O(|X|n•|C|) parameters Could only be estimated if a very, very large

number of training examples was available.

Flu

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

Conditional Independence Assumption

Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).

features detect term presence and are independent of each other given the class:)|()|()|()|,,( 52151 CXPCXPCXPCXXP

Learning the Model

First attempt: maximum likelihood estimates simply use the frequencies in the data

)(

),()|(ˆ

j

jiiji cCN

cCxXNcxP

C

X1 X2 X5X3 X4 X6

N

cCNcP jj

)()(ˆ

What if we have seen no training cases where patient had no flu and muscle aches?

Zero probabilities cannot be conditioned away, no matter the other evidence!

Problem with Max Likelihood

0)(

),()|(ˆ 5

5

nfCN

nfCtXNnfCtXP

i ic cxPcP )|(ˆ)(ˆmaxarg

Flu

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

)|()|()|()|,,( 52151 CXPCXPCXPCXXP

Smoothing to Eliminate Zeros

kcCN

cCxXNcxP

j

jiiji

)(

1),()|(ˆ

# of values of Xi

Add one smooth (Laplace smoothing) As a uniform prior (each attribute occurs once for

each class) that is then updated as evidence from the training data comes in.

Document Generative Model

“Love is patient, love is kind. “ Basic : bag of words a binary independence model

Multivariate binomial generation feature Xi is term Value Xi = 1 or 0, indicate term Xi present in doc

or not a multinomial unigram language model

Multinomial generation feature Xi is term position Value of Xi = term at position i position independent

Love is

patient

kind

Bernoulli Naive Bayes Classifiers

Multivariate binomial Model One feature Xw for each word in

dictionary Value Xw = true in document d if w

appears in d Naive Bayes assumption:

Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

Love is

patient

kind

Multinomial Naive Bayes Classifiers II

Multinomial = Class conditional unigram One feature Xi for each word pos in

document feature’s values are all words in

dictionary Value of Xi is the word in position i Naïve Bayes assumption:

Given the document’s topic, word in one position in the document tells us nothing about words in other positions

)|text""()|our""()(argmax

)|()(argmax

1j

j

jnjjCc

ijij

CcNB

cxPcxPcP

cxPcPc

Still too many possibilities Assume that classification is independent of

the positions of the words Second assumption:

Word appearance does not depend on position

Just have one multinomial feature predicting for all words

Use same parameters for each position

Multinomial Naive Bayes Classifiers

)|()|( cwXPcwXP ji for all positions i,j, word w, and class c

Parameter estimation

fraction of documents of topic cj

in which word w appears

Binomial model:

Multinomial model:

)|(ˆjw ctXP

fraction of times in which word w appears

across all documents of topic cj

)|(ˆji cwXP

Naive Bayes algorithm (Multinomial model)

Naive Bayes algorithm (Bernoulli model)

NB Example

c(5)=?

NB Example

c(5)=?

Multinomial NB Classifier

Feature likelihood estimate

Posterior

Result: c(5) = China

NB Example

c(5)=?

Bernoulli NB Classifier

Feature likelihood estimate

Posterior

Result: c(5) <> China

Classification

Multinomial vs Multivariate binomial?

Multinomial is in general better

Classification Evaluation

Let’s think about it…

How to do evaluation experiments for classifiers? Dataset Measures Generalization Performance

Training setTraining set

Test setTest set

RecallRecall PrecisionPrecision

F1F1 AccuracyAccuracy

Most (over)used data set 21578 documents 9603 training, 3299 test articles (ModApte

split) 118 categories

An article can be in more than one category Learn 118 binary category distinctions

Average document: about 90 types, 200 tokens

Average number of classes assigned 1.24 for docs with at least one category

Only about 10 out of 118 categories are largeCommon categories(#train, #test)

Classic Reuters Data Set

• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)

• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)

Measuring ClassificationFigures of Merit

Accuracy of classification Main evaluation criterion in academia

Speed of training statistical classifier Some methods are very cheap; some very

costly Speed of classification (docs/hour)

No big differences for most algorithms Exceptions: kNN, complex preprocessing

requirements Effort in creating training set/hand-built

classifier human hours/topic

Measuring ClassificationFigures of Merit

In the real world, economic measures: Your choices are:

Do no classification That has a cost (hard to compute)

Do it all manually Has an easy to compute cost if doing it like that now

Do it all with an automatic classifier Mistakes have a cost

Do it with a combination of automatic classification and manual review of uncertain/difficult/”new” cases

Commonly the last method is most cost efficient and is adopted

Per class evaluation measures

Recall: Fraction of docs in class i classified correctly:

Precision: Fraction of docs assigned class i that are actually about class i:

“Correct rate”: (1- error rate) Fraction of docs classified correctly:

j

ij

ii

c

c

j

ji

ii

c

c

jij

i

iii

c

c

A B C

A

B

C

Actual Class

Predictedclass

Measuring Classification

Overall error rate Not a good measure for small classes. Why?

Precision/recall for classification decisions F1 measure: 1/F1 = ½ (1/P + 1/R) Correct estimate of size of category

Why is this different? Stability over time / category drift Utility

Costs of false positives / false negatives may be different

For example, cost = tp-0.5fp

Generalization Performance

Generalization Performance Results can vary based on sampling error

due to different training and test sets. Average results over multiple training and

test sets (splits of the overall data) for the best results.

Ideally, test and training sets are independent on each trial.

But this would require too much labeled data.But this would require too much labeled data.

Good practice department

N-Fold Cross-Validation Partition data into N equal-sized disjoint

segments. Run N trials, each time using a different

segment of the data for testing, and training on the remaining N1 segments.

This way, at least test-sets are independent.

Report average classification accuracy over the N trials.

Typically, N = 10.

Good practice department II

Learning Curves Would like to know how performance varies with

the number of training instances. Learning curves plot classification accuracy on

independent test data (Y axis) versus number of training examples (X axis).

One can do both the above and produce learning curves averaged over multiple trials from cross-validation

how to combine multiple measures?

If we have more than one class, how do we combine multiple performance measures into one quantity?

Macroaveraging: Compute performance for each class, then

average. Microaveraging:

Collect decisions for all classes, compute contingency table, evaluate.

Micro- vs. Macro-Averaging: Example

Truth: yes

Truth: no

Classifier: yes

10 10

Classifier: no

10 970

Truth: yes

Truth: no

Classifier: yes

90 10

Classifier: no

10 890

Truth: yes

Truth: no

Classifier: yes

100 20

Classifier: no

20 1860

Class 1 Class 2 Micro.Av. Table

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Why this difference?

Exercise

Federalist papers 1787-1788 年间由

Hamilton, Jay and Madison 用笔名发表的 77 篇短文,来劝说NY 支持 US Constitution

其中 12 篇 papers 的作者存在争议

谁是作者? 谁是作者?

53

Author identification

In 1964 Mosteller and Wallace* solved the problem Mosteller, Frederick and Wallace, David L.

1964. Inference and Disputed Authorship: The Federalist.

It’s a Text Catergorization Problem They identified 70 function words as good

candidates for authorship analysis Using statistical inference they concluded

the author was Madison

Feature Selection

Classifier

Function words for Author Identification

Function Words for Author Identification

Summary

Definition The category of x:

c(x)C K-Nearest Neighbor Naïve Bayes

Bayesian Methods Bernoulli NB classifier Multinomial NB classifier

Categorization Evaluation

Training data/Test data Over-fitting & Generalize

)|text""()|our""()(argmax

)|()(argmax

1j

j

jnjjCc

ijij

CcNB

cxPcxPcP

cxPcPc

Thank You!

Q&A

Readings

[1] IIR Ch13, Ch14.2 [2] Y. Yang and X. Liu, "A re-examination of

text categorization methods," presented at Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 1999.

Bernoulli trial

Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes, "success" and "failure".

Binomial Distribution

binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.

Multinomial Distribution

multinomial distribution is a generalization of the binomial distribution.

each trial results in one of some fixed finite number k of possible outcomes, with probabilities p1, ..., pk,

there are n independent trials.

We can use a random variable Xi to indicate the number of times outcome number i was observed over the n trials.

Bayes’ Rule

priorprior

posteriorposterior likelihoodlikelihood

Use Bayes Rule to Gamble

someone draws an envelope at random and offers to sell it to you. How much should you pay?

before deciding, you are allowed to see one bead drawn from the envelope. Suppose it’s red: How much should you pay?

Prosecutor's fallacy

You win the lottery jackpot. You are then charged with having cheated, for instance with having bribed lottery officials.

At the trial, the prosecutor points out that winning the lottery without cheating is extremely unlikely, and that therefore your being innocent must be comparably unlikely.

Maximum a posteriori Hypothesis

)|(argmax DhPhHh

MAP

)(

)()|(argmax

DP

hPhDP

Hh

)()|(argmax hPhDPHh

As P(D) isconstant

Maximum likelihood Hypothesis

If all hypotheses are a priori equally likely, we only

need to consider the P(D|h) term:)|(argmax hDPh

HhML

Likelihood

a likelihood function is a conditional probability function considered as a function of its second argument with its first argument held fixed

Given a parameterized family of probability density functions

Where θ is the parameter, the likelihood function is

where x is the observed outcome of an experiment.

when f(x | θ) is viewed as a function of x with θ fixed, it is a probability density function,

when viewed as a function of θ with x fixed, it is a likelihood function.

)|( xfx

)|()|( xfxL

Reuters Text Categorization data set (Reuters-21578) document

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off

tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

&#3;</BODY></TEXT></REUTERS>

New Reuters: RCV1: 810,000 docs

Top topics in Reuters RCV1

北大天网 : 中文网页分类 通过动员不同专业的几十个学生,人工选取形成了一个基

于层次模型的大规模中文网页样本集。 包括 12,336 个训练网页实例和 3,269 个测试网页实例,分

布在 12 个大类 , 共计 733 个类别中,每个类别平均有 17个训练实例和 4.5 个测试实例

天网免费提供网页样本集给有兴趣的同行,燕穹产品号:YQ-WEBENCH-V0.8

中文信息检索论坛 www.cwirf.org 全国搜索引擎和网上信息挖掘学术研讨会 SEWM 上进行分

类评测

北大天网 : 中文网页分类

类别编号

类别名称 类别数 训练样本数

测试样本数

1 人文与艺术 24 419 110

2 新闻与媒体 7 125 19

3 商业与经济 48 839 214

4 娱乐与休闲 88 1510 374

5 计算机与因特网 58 925 238

6 教育 18 286 85

7 各国风情 53 891 235

8 自然科学 113 1892 514

9 政府与政治 18 288 84

10 社会科学 104 1765 479

11 医疗与健康 136 2295 616

12 社会与文化 66 1101 301

共计 733 12336 3269

 

Concept Drift

Categories change over time Example: “president of the united states”

1999: clinton is great feature 2002: clinton is bad feature

One measure of a text classification system is how well it protects against concept drift.

Feature selection: can be bad in protecting against concept drift

Think about it…

Describe the process, can you tell what is it?

How do you do it? Why we talk about it here, or what does it

mean to Information Overloading?

鹰 肉鸡

Recall: Vector Space Representation

Each document is a vector, one component for each term (= word).

Normalize to unit length. High-dimensional vector space:

Terms are axes 10,000+ dimensions, or even 100,000+ Docs are vectors in this space

Recommended