29
Text Classification Eric Doi Harvey Mudd College November 20th, 2008

Text Classification

  • Upload
    theo

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

Text Classification. Eric Doi Harvey Mudd College November 20th, 2008. Kinds of Classification. Language Hello. My name is Eric. Hola. Mi nombre es Eric. こんにちは。 私の名前はエリックである . 你好。 我叫 Eric 。. Kinds of Classification. Type - PowerPoint PPT Presentation

Citation preview

Page 1: Text Classification

Text Classification

Eric Doi

Harvey Mudd College

November 20th, 2008

Page 2: Text Classification

Kinds of Classification

Language Hello. My name is Eric. Hola. Mi nombre es Eric. こんにちは。 私の名前はエリックである . 你好。 我叫 Eric 。

Page 3: Text Classification

Kinds of Classification Type

“approaches based on n-grams obtain generalization by concatenating”*

To: [email protected]

Subject: McCain and Obama use it tooYou have received this message because you opted in to receives Sund Design pecial offers via email. Login to your member account to edit your email subscription . Click here to unsubscribe.

ACAAGATGCCATTGTCCCCCGGCCTCCTG

*(Bengio)

Page 4: Text Classification

Difficulties

Dictionary? Generalization? Over 500,000 words in English language

(and over one million if counting scientific words)

Typos/OCR errors Loan words

We practice ballet at the café. Nous pratiquons le ballet au café.

Page 5: Text Classification

Approaches

Unique letter combinationsLanguage String

English “ery”

French “eux”

Gaelic “mh”

Italian “cchi“

Dunning, Statistical Identification of Language

Page 6: Text Classification

Approaches

“Unique” letter combinationsLanguage String

English “ery”

French “milieux”

Gaelic “farmhand”

Italian “zucchini“

Dunning, Statistical Identification of Language

Page 7: Text Classification

Approaches

“Unique” letter combinationsLanguage String

English “ery”

French “milieux”

Gaelic “farmhand”

Italian “zucchini“ Requires hand-coding; what about other

languages (6000+)?

Dunning, Statistical Identification of Language

Page 8: Text Classification

Approaches

Try to minimize: Hand-coded knowledge Training data Input data (isolating phrases?)

Dunning, “Statistical Identification of Language.” 1994.

Bengio, “A Neural Probabilistic Language Model.” 2003.

Page 9: Text Classification

Statistical Approach: N-Grams

N-grams are sequences of n elements

Professor Keller is not a goth.

Word-level bigrams:

Char-level trigrams:

Page 10: Text Classification

Statistical Approach: N-Grams

N-grams are sequences of n elements

Professor Keller is not a goth.

Word-level bigrams: (Professor, Keller)

Char-level trigrams:

Page 11: Text Classification

Statistical Approach: N-Grams

N-grams are sequences of n elements

Professor Keller is not a goth.

Word-level bigrams: (Keller, is)

Char-level trigrams:

Page 12: Text Classification

Statistical Approach: N-Grams

N-grams are sequences of n elements

Professor Keller is not a goth.

Word-level bigrams: (is, not)

Char-level trigrams:

Page 13: Text Classification

Statistical Approach: N-Grams

N-grams are sequences of n elements

Professor Keller is not a goth.

Word-level bigrams: (not, a)

Char-level trigrams:

Page 14: Text Classification

Statistical Approach: N-Grams

N-grams are sequences of n elements

Professor Keller is not a goth.

Word-level bigrams: (a, goth)

Char-level trigrams:

Page 15: Text Classification

Statistical Approach: N-Grams

N-grams are sequences of n elements

Professor Keller is not a goth.

Word-level bigrams:

Char-level trigrams:

Page 16: Text Classification

Statistical Approach: N-Grams

N-grams are sequences of n elements

Professor Keller is not a goth.

Word-level bigrams:

Char-level trigrams: (P, r, o)

Page 17: Text Classification

Statistical Approach: N-Grams

N-grams are sequences of n elements

Professor Keller is not a goth.

Word-level bigrams:

Char-level trigrams: (r, o, f)

Page 18: Text Classification

Statistical Approach: N-Grams

N-grams are sequences of n elements

Professor Keller is not a goth.

Word-level bigrams:

Char-level trigrams: (o, f, e)

Page 19: Text Classification

Statistical Approach: N-Grams

N-grams are sequences of n elements

Professor Keller is not a goth.

Word-level bigrams:

Char-level trigrams: (f, e, s)

Page 20: Text Classification

Statistical Approach: N-Grams

N-grams are sequences of n elements

Professor Keller is not a goth.

Word-level bigrams:

Char-level trigrams:

Page 21: Text Classification

Statistical Approach: N-Grams

Mined from 1,024,908,267,229 words Sample 4-grams

serve as the infrastructure 500

serve as the initial 5331

serve as the injector 56

Page 22: Text Classification

Statistical Approach: N-Grams Informs some notion of probability

Normalize frequencies P(serve as the initial) > P(serve as the injector) Classification

P(English | serve as the initial) >P(Spanish | serve as the initial)

P(Spam | serve as the injector) <P(!Spam | serve as the injector)

Page 23: Text Classification

Statistical Approach: N-Grams

But what about P(serve as the ink)? = 0? P(serve as the ink) = P(vxvw aooa *%^$) = 0? How about P(sevre as the initial)?

Page 24: Text Classification

Statistical Approach: N-Grams

How do we smooth out sparse data? Additive smoothing Interpolation Good-Turing estimate Backoff Witten-Bell smoothing Absolute discounting Kneser-Key smoothing

MacCartney

Page 25: Text Classification

Statistical Approach: N-Grams

Additive smoothing

Interpolation- consider smaller n-grams as well, e.g. (serve as the), (serve)

Backoff- use interpolation only if necessary

MacCartney

Page 26: Text Classification

Statistical Approach: Results

Dunning: Compared parallel translated texts in English and Spanish 20 char input, 50K training: 92% accurate 500 char input, 50K training: 99.9%

Modified for comparing DNA sequences of Humans, E-Coli, and Yeast

Page 27: Text Classification

Neural Network Approach

Bengio et. al, “A Neural Probabilistic Language Model.” 2003:

N-gram does handle sparse data well However, there are problems:

Narrow consideration of context (~1–2 words) Does not consider semantic/grammatical

similarity:“A cat is walking in the bedroom”“A dog was running in a room”

Page 28: Text Classification

Neural Network Approach

The general idea:1. Associate with each word in the vocabulary

(e.g. size 17,000) a feature vector (30–100 features)

2. Express the joint probability function of word sequences in terms of feature vectors

3. Learn simultaneously the word feature vectors and the parameters of the probability function

Page 29: Text Classification

References Dunning, “Statistical Identification of Language.”

1994. Bengio, “A Neural Probabilistic Language Model.”

2003. MacCartney, “NLP Lunch Tutorial: Smoothing.”

2005.