Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build...

Text Categorization

PengBo10/31/2010

本次课大纲

Text Categorization Problem definition Build a Classifier

Naïve Bayes Classifier K-Nearest Neighbor Classifier

Evaluation

Definition

Given: 实例 instance, xX, where X is the instance

language or instance space. Issue: how to represent text documents.

固定的类别集合 categories: C = {c1, c2,…, cn}

Determine: The category of x : c(x)C,

where c(x) is a categorization function 分类函数 We want to know how to build categorization

functions (“classifiers 分类器” ).

Text Categorization Examples

Assign labels to each document or web-page: Labels are most often topics such as Yahoo-categories

e.g., "finance," "sports," "news>world>asia>business" Labels may be genres

e.g., "editorials" "movie-reviews" "news“ Labels may be opinion

e.g., “like”, “hate”, “neutral” Labels may be domain-specific binary

e.g., "interesting-to-me" : "not-interesting-to-me” e.g., “spam” : “not-spam” e.g., “contains adult language” :“doesn’t”

Classification Methods

人工分类 Manual classification Used by Yahoo!, Looksmart, about.com, ODP, Medline Accurate but expensive to scale

自动文本分类 Automatic document classification 基于规则： Hand-coded rule-based systems

Spam mail filter,… 有监督的学习： Supervised learning of a document-

label assignment function No free lunch: requires 人工标注的训练集 hand-

classified training data Note that many commercial systems use a

mixture of methods

Think about it…

How to represent text documents and categories? Vectors & Regions String & Language (models)

How to build categorization functions ？ Closeness/Similarity to regions Probability to generate the

string/language model

K-Nearest Neighbors

Classes in a Vector Space

Government

Science

Classification Using Vector Spaces

Each training doc a point (vector) labeled by its topic (= class)

Hypothesis: docs of the same class form a contiguous region of space

We define surfaces to delineate classes in space

Test Document = Government

Government

Science

Similarityhypothesistrue ingeneral?

k Nearest Neighbor Classification

To classify document d into class c Define k-neighborhood N as k nearest

neighbors of d Count number of documents i in N that belong

to c Estimate P(c|d) as i/k Choose as class argmaxc P(c|d) [ = majority

class]

Example: k=6 (6NN)

Government

Science

P(science| )?

Nearest-Neighbor Learning Algorithm

Learning is just storing the representations of the training examples in D.

Testing instance x: Compute similarity between x and all examples

in D. Assign x the category of the most similar

example in D. Does not explicitly compute a generalization

or category prototypes. Also called:

Case-based learning Memory-based learning Lazy learning

Why K?

Using only the closest example to determine the categorization is subject to errors due to: A single atypical example. Noise (i.e. error) in the category label of a single

training example. More robust alternative is to find the k

most-similar examples and return the majority category of these k examples.

Value of k is typically odd to avoid ties; 3 and 5 are most common.

kNN decision boundaries

Government

Science

Boundaries are in principle arbitrary surfaces – but usually polyhedra

Similarity Metrics

Nearest neighbor method depends on a similarity (or distance) metric. Simplest for continuous m-dimensional instance

space is Euclidean distance. Simplest for m-dimensional binary instance

space is Hamming distance (number of feature values that differ).

For text, cosine similarity of tf.idf weighted vectors is typically most effective.

Illustration of 3 Nearest Neighbor for Text Vector Space

Nearest Neighbor with Inverted Index

Naively finding nearest neighbors requires a linear search through |D| documents in collection

But determining k nearest neighbors is the same as determining the top-k best retrievals using the test document as a query to a database of training documents.

Use standard vector space inverted index methods to find the k nearest neighbors.

Testing Time: O(B|Vt|) where B is the average number of training documents in which a test-document word appears. Typically B << |D|

kNN: Discussion

No training necessary No feature selection necessary Scales well with large number of classes

Don’t need to train n classifiers for n classes Classes can influence each other

Small changes to one class can have ripple effect

Scores can be hard to convert to probabilities

Naïve Bayes

Bayes Classifiers

Task: Classify a new instance D based on a tuple of attribute values into one of the classes cj C nxxxD ,,, 21

),,,|(argmax 21 njCc

MAP xxxcPcj

)()|,,,(argmax

Cc xxxP

cPcxxxP

)()|,,,(argmax 21 jjnCc

cPcxxxPj

Naïve Bayes Assumption

P(cj) Can be estimated from the frequency of classes

in the training examples. P(x1,x2,…,xn|cj)

O(|X|n•|C|) parameters Could only be estimated if a very, very large

number of training examples was available.

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

Conditional Independence Assumption

Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).

features detect term presence and are independent of each other given the class:)|()|()|()|,,( 52151 CXPCXPCXPCXXP

Learning the Model

First attempt: maximum likelihood estimates simply use the frequencies in the data

),()|(ˆ

jiiji cCN

cCxXNcxP

X1 X2 X5X3 X4 X6

cCNcP jj

)()(ˆ

What if we have seen no training cases where patient had no flu and muscle aches?

Zero probabilities cannot be conditioned away, no matter the other evidence!

Problem with Max Likelihood

),()|(ˆ 5

nfCtXNnfCtXP

i ic cxPcP )|(ˆ)(ˆmaxarg

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

)|()|()|()|,,( 52151 CXPCXPCXPCXXP

Smoothing to Eliminate Zeros

cCxXNcxP

1),()|(ˆ

# of values of Xi

Add one smooth (Laplace smoothing) As a uniform prior (each attribute occurs once for

each class) that is then updated as evidence from the training data comes in.

Document Generative Model

“Love is patient, love is kind. “ Basic : bag of words a binary independence model

Multivariate binomial generation feature Xi is term Value Xi = 1 or 0, indicate term Xi present in doc

or not a multinomial unigram language model

Multinomial generation feature Xi is term position Value of Xi = term at position i position independent

Love is

patient

Bernoulli Naive Bayes Classifiers

Multivariate binomial Model One feature Xw for each word in

dictionary Value Xw = true in document d if w

appears in d Naive Bayes assumption:

Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

Love is

patient

Multinomial Naive Bayes Classifiers II

Multinomial = Class conditional unigram One feature Xi for each word pos in

document feature’s values are all words in

dictionary Value of Xi is the word in position i Naïve Bayes assumption:

Given the document’s topic, word in one position in the document tells us nothing about words in other positions

)|text""()|our""()(argmax

)|()(argmax

jnjjCc

cxPcxPcP

cxPcPc

Still too many possibilities Assume that classification is independent of

the positions of the words Second assumption:

Word appearance does not depend on position

Just have one multinomial feature predicting for all words

Use same parameters for each position

Multinomial Naive Bayes Classifiers

)|()|( cwXPcwXP ji for all positions i,j, word w, and class c

Parameter estimation

fraction of documents of topic cj

in which word w appears

Binomial model:

Multinomial model:

)|(ˆjw ctXP

fraction of times in which word w appears

across all documents of topic cj

)|(ˆji cwXP

Naive Bayes algorithm (Multinomial model)

Naive Bayes algorithm (Bernoulli model)

NB Example

c(5)=?

NB Example

c(5)=?

Multinomial NB Classifier

Feature likelihood estimate

Posterior

Result: c(5) = China

NB Example

c(5)=?

Bernoulli NB Classifier

Feature likelihood estimate

Posterior

Result: c(5) <> China

Classification

Multinomial vs Multivariate binomial?

Multinomial is in general better

Classification Evaluation

Let’s think about it…

How to do evaluation experiments for classifiers? Dataset Measures Generalization Performance

Training setTraining set

Test setTest set

RecallRecall PrecisionPrecision

F1F1 AccuracyAccuracy

Most (over)used data set 21578 documents 9603 training, 3299 test articles (ModApte

split) 118 categories

An article can be in more than one category Learn 118 binary category distinctions

Average document: about 90 types, 200 tokens

Average number of classes assigned 1.24 for docs with at least one category

Only about 10 out of 118 categories are largeCommon categories(#train, #test)

Classic Reuters Data Set

• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)

• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)

Measuring ClassificationFigures of Merit

Accuracy of classification Main evaluation criterion in academia

Speed of training statistical classifier Some methods are very cheap; some very

costly Speed of classification (docs/hour)

No big differences for most algorithms Exceptions: kNN, complex preprocessing

requirements Effort in creating training set/hand-built

classifier human hours/topic

Measuring ClassificationFigures of Merit

In the real world, economic measures: Your choices are:

Do no classification That has a cost (hard to compute)

Do it all manually Has an easy to compute cost if doing it like that now

Do it all with an automatic classifier Mistakes have a cost

Do it with a combination of automatic classification and manual review of uncertain/difficult/”new” cases

Commonly the last method is most cost efficient and is adopted

Per class evaluation measures

Recall: Fraction of docs in class i classified correctly:

Precision: Fraction of docs assigned class i that are actually about class i:

“Correct rate”: (1- error rate) Fraction of docs classified correctly:

Actual Class

Predictedclass

Measuring Classification

Overall error rate Not a good measure for small classes. Why?

Precision/recall for classification decisions F1 measure: 1/F1 = ½ (1/P + 1/R) Correct estimate of size of category

Why is this different? Stability over time / category drift Utility

Costs of false positives / false negatives may be different

For example, cost = tp-0.5fp

Generalization Performance

Generalization Performance Results can vary based on sampling error

due to different training and test sets. Average results over multiple training and

test sets (splits of the overall data) for the best results.

Ideally, test and training sets are independent on each trial.

But this would require too much labeled data.But this would require too much labeled data.

Good practice department

N-Fold Cross-Validation Partition data into N equal-sized disjoint

segments. Run N trials, each time using a different

segment of the data for testing, and training on the remaining N1 segments.

This way, at least test-sets are independent.

Report average classification accuracy over the N trials.

Typically, N = 10.

Good practice department II

Learning Curves Would like to know how performance varies with

the number of training instances. Learning curves plot classification accuracy on

independent test data (Y axis) versus number of training examples (X axis).

One can do both the above and produce learning curves averaged over multiple trials from cross-validation

how to combine multiple measures?

If we have more than one class, how do we combine multiple performance measures into one quantity?

Macroaveraging: Compute performance for each class, then

average. Microaveraging:

Collect decisions for all classes, compute contingency table, evaluate.

Micro- vs. Macro-Averaging: Example

Truth: yes

Truth: no

Classifier: yes

Classifier: no

10 970

Truth: yes

Truth: no

Classifier: yes

Classifier: no

10 890

Truth: yes

Truth: no

Classifier: yes

100 20

Classifier: no

20 1860

Class 1 Class 2 Micro.Av. Table

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Why this difference?

Exercise

Federalist papers 1787-1788 年间由

Hamilton, Jay and Madison 用笔名发表的 77 篇短文，来劝说NY 支持 US Constitution

其中 12 篇 papers 的作者存在争议

谁是作者？谁是作者？

Author identification

In 1964 Mosteller and Wallace* solved the problem Mosteller, Frederick and Wallace, David L.

1964. Inference and Disputed Authorship: The Federalist.

It’s a Text Catergorization Problem They identified 70 function words as good

candidates for authorship analysis Using statistical inference they concluded

the author was Madison

Feature Selection

Classifier

Function words for Author Identification

Function Words for Author Identification

Summary

Definition The category of x:

c(x)C K-Nearest Neighbor Naïve Bayes

Bayesian Methods Bernoulli NB classifier Multinomial NB classifier

Categorization Evaluation

Training data/Test data Over-fitting & Generalize

)|text""()|our""()(argmax

)|()(argmax

jnjjCc

cxPcxPcP

cxPcPc

Thank You!

Readings

[1] IIR Ch13, Ch14.2 [2] Y. Yang and X. Liu, "A re-examination of

text categorization methods," presented at Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 1999.

Bernoulli trial

Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes, "success" and "failure".

Binomial Distribution

binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.

Multinomial Distribution

multinomial distribution is a generalization of the binomial distribution.

each trial results in one of some fixed finite number k of possible outcomes, with probabilities p1, ..., pk,

there are n independent trials.

We can use a random variable Xi to indicate the number of times outcome number i was observed over the n trials.

Bayes’ Rule

priorprior

posteriorposterior likelihoodlikelihood

Use Bayes Rule to Gamble

someone draws an envelope at random and offers to sell it to you. How much should you pay?

before deciding, you are allowed to see one bead drawn from the envelope. Suppose it’s red: How much should you pay?

Prosecutor's fallacy

You win the lottery jackpot. You are then charged with having cheated, for instance with having bribed lottery officials.

At the trial, the prosecutor points out that winning the lottery without cheating is extremely unlikely, and that therefore your being innocent must be comparably unlikely.

Maximum a posteriori Hypothesis

)|(argmax DhPhHh

)()|(argmax

)()|(argmax hPhDPHh

As P(D) isconstant

Maximum likelihood Hypothesis

If all hypotheses are a priori equally likely, we only

need to consider the P(D|h) term:)|(argmax hDPh

Likelihood

a likelihood function is a conditional probability function considered as a function of its second argument with its first argument held fixed

Given a parameterized family of probability density functions

Where θ is the parameter, the likelihood function is

where x is the observed outcome of an experiment.

when f(x | θ) is viewed as a function of x with θ fixed, it is a probability density function,

when viewed as a function of θ with x fixed, it is a likelihood function.

)|( xfx

)|()|( xfxL

Reuters Text Categorization data set (Reuters-21578) document

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off

tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

</BODY></TEXT></REUTERS>

New Reuters: RCV1: 810,000 docs

Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build...

Documents

Implementasi Metode Naïve Bayes Classifier Untuk Seleksi ......Classifier. 2.3 Naive Bayes Classifier Naïve Bayes Classifier merupakan penyederhanaan dari teorema Bayes, penemu metode

Inventory management, loading strategy and warehouse categorization

Text Categorization - Riccardo Torlonetorlone.dia.uniroma3.it/sistelab/annipassati/biancalana1.pdfText categorization con Rocchio 20 Algoritmo di apprendimento Nearest-Neighbor •

Naive bayes classifier

ザ・ファイナル · SoftLayer Bluemix Microsoft IBM I IaaS PaaS •Natural Language Classifier •Retrieve and Rank •Dialog •Document Conversion •Speech to Text •Text

Text Categorization - Dipartimento di Informaticasemeraro/GCI/textlearning_TC.pdf2 Text Categorization 3 Il task della classificazione Dati: 9La descrizione di una istanza, x∈X,

trés utils pour classifier les risques

IMPLEMENTASI NAÏVE BAYES CLASSIFIER DAN ASOSIASI …

Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 김민호 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81

Unsupervised Concept Categorization and Extraction from …hanj.cs.illinois.edu/pdf/cikm17_akrishnan.pdf · 2017. 9. 30. · Unsupervised Concept Categorization and Extraction from

Classifier - Manual do Usuário - v1.6

Classifier-Based Approximate Policy Iteration

Categorization Del Pip

A Text Categorization Approach to Automatic Image Annotationberlin.csie.ntnu.edu.tw/Courses/2006F-Speech... · content based on low-level features Bear, Polar, Snow, Tundra Training

1 A Survey on Text Categorization 산업공학과 진재훈. Evaluation Summary Appendix Text cate gorization Document indexing Inductive learning Reference 2 Contents

Dm Bayesian Classifier

Klasifikasi Metagenom dengan Metode Naïve Bayes Classifier … · Klasifikasi Metagenom dengan Metode Naïve Bayes Classifier Metagenome Classification Using Naïve Bayes Classifier

Spiral Classifier

Naive Bayes and Gaussian Bayes Classifier

From Large Scale Image Categorization to Entry …web.eecs.umich.edu/~jiadeng/paper/naming_iccv2013.pdfFrom Large Scale Image Categorization to Entry-Level Categories Vicente Ordonez1,