Natural language processing with naive bayes

Preview:

DESCRIPTION

A little talk I gave about NLP with naive bayes for classification. I used the ideas to build http://twedar.herokuapp.com, and a client-side classifier for Skimlinks

Citation preview

Natural Language Processing with Naïve

BayesTim Ruffles@timruffles

Overview

● Intro to Natural Language Processing● Intro to Bayes● Bayesian Maths● Bayes applied to Natural Language

Processing

NLP

(not like Derren Brown)

Processing text● Named entity recognition - Skimwords● Information retrieval - Google● Information extraction - IBM's Watson● Interpreting - sentiment, named entities● Classification - spam vs not spam● Speech to text - Siri

Named entity recognition

ClassificationFrom: Prime Minister of NigeriaSubject: Opportunity

Dear Sir,

My country vexes me; I wish to leave. Please give me your bank account information for instantaneous enrichment, no danger to you!

Yours in good faith and honour,Mr P. Minister

From: sally@gmail.comSubject: cats

lol this cat is really fat

http://reddit.com/r/fat-cats/roflolcoptor-fat-cat-dancing.gif

spam: 99% ham: 1%

spam: 1% ham: 99%

Example Task

Identify Product References

How do humans do this?

● Algorithms are far dumber than you● If you don't have enough info, an

algorithm will not help● Anyone can identify features required for

natural language processing

Features

The new cameras are the Canon PowerShot S100, the Nikon J1 and the Olympus PEN.

Types of features

● Word shape (capitalization, numbers etc)● Tag context - near a product● Dictionary/gazette - list of brands● Part of speech - verb, noun● n-Grams - products contain only one

brand

NLP process

The new cameras are Canon's PowerShot S100, the Nikon J1 and the Olympus PEN.

The new cameras are Canon's PowerShot S100, the Nikon J1 and the Olympus PEN.

Supervision

The new cameras are , theand the

Canon's PowerShot S100Nikon J1Olympus PEN.

Feature extraction

The new cameras are , theand the

Canon's PowerShot S100Nikon J1Olympus PEN.

Correlate features & tagscapital in middle of sentence: 0capital in middle of word: 0acronyms: 0words with numbers in them: 0

capital in middle of sentence: 7capital in middle of word: 1acronyms: 1words with numbers in them: 2

NLP OverviewSupervision with tagged data

Training up a model

Test model on test set

Model ready to use

Nuts and boltsSupervision - create a test set of labelled data

Normalisation and clean-up (Canon's -> Canon etc)

Feature extraction and training on training set

Validate on test set

How to use features/tags to tag products?

● Need to a method for using our correlated feature/tag sets to learn from and predict mathematically

● One such method is...

Naïve Bayes

When my information changes, I alter my conclusions.

What do you do, sir?

Keynes

Mathematically updating our beliefs on evidence

Bayes: local hero

Thomas Bayes

An Essay towards solving a Problem in the Doctrine of

Chances1763

Example applications

● Given a drug test result, how likely is it a person has taken drugs?

● Give these words, how likely is it that this email is spam?

● Given these words, how likely is it they refer to a product?

Estimate

● 99% accurate drug test● 1% of people actually take drugs

Given the above, what is the probability that someone indicated as drug positive by the test is a drug user?

Place your bets

50%

The Maths

A Little Notation

Probability

0 0.5 1

Impossible

You'd never bet on it happening

Likely as not - evens

Best odds you'd get would be 1/2

Certain

You'd never bet against

More notationP(spam)

Probability of spam

P(^spam)

Probability of not spam

P(spam|features)

Probability of spam given some features

A few rulesP(6,6) = P(6)P(6) = 1/6 x 1/6 = 1/36

Probability of rolling 6 twice

P(^6) = 1 - P(6) = 1 - 1/6 = 5/6

Probability of not rolling six is inverse of rolling a six

IndependenceP(A,B) = P(A)P(B)

Only applies if two events are independent.

Events are independent if the one having happened has no bearing on how likely it is the other will.

Dependence is informative

e.g: if someone is paler than normal, they could be sick

P(sick|pale) ≠ P(sick)

if someone fails a drug test, they could be a drug user

P(A|B)?What is the probability of A, given that B has happened?

Drugs test

● 99% accurate, 1% of people take drugs● Prior probability that someone is a drug

user: 1%● 1% chance of a false positive

Probability of something not happening is inverse of it happening.

Priors (pre information)

Prior: drug use

P(drug use) = 0.01 = 1/100 = 1%

Prior: false positive

P(false positive) = 0.01 = 1/100 = 1%

A drug test is asking

P(drug user | positive drug test)

Union of a signal and an event

P(drug user | positive drug test)

P(event | signal)

We can see an signal at least 2 ways

Can see a positive in 2 ways:

P(drug user , positive drug test)P(non user , positive drug test)

or a negative in two ways:

P(drug user , negative drug test)P(non user , negative drug test)

The theoremThe chance of an event given a signal is the ratio of:

the prior probability of the event multiplied by that of seeing the signal given the event

to

all the ways you could see that signal.

The calculation

P(drug user | positive drug test) =

P(drug user) x P(positive drug test | drug user)

P(positive drug test)

Estimate

● 99% accurate drug test● 1% of people take drugs

Given the above, what is the chance someone failing the drugs test is a drug user?

Place your bets

50%

The calculation

P(drug user | positive drug test) =

1/100 x 99/100

P(positive drug test)

P(B)?

P(B)

● All the ways you could see the signal

∑ P(event) x P(signal | event)

(∑ is sum of, ie add all the things)

P(B)

● In our case there are two possibilities - person is either a drug user or not - we already know the result of the test

P(user) x P(positive | user) P(A) x P(B|A) + +P(clean) x P(positive | clean) P(^A) x P(B|^A)

The calculation

P(drug user | positive drug test) =

1/100 x 99/100

(1/100 x 99/100) + (99/100 x 1/100)

The calculation

P(drug user | positive drug test) =

1 x 99

(1 x 99) + (99 x 1)

The calculation

P(drug user | positive drug test) =

99

99(1 + 1)

The calculation

P(drug user | positive drug test) =

1

2

Maths applied to NLP

Building a spam filter

● Using what we know about Bayes, we're going to build an NLP spam filter

● We'll use n-grams as our features - the number of times we have seen each word

● 1-gram is each word, 2-grams are pairs of words: 2-grams are more accurate but more complex

1-gramsDear Sir,

Give me your bank account. I will transfer money from my bank account to your bank account.

Yours in good faith and honour,Mr P. Minister

Hi,

Lovely to see you last night. I'll pay you back for the film - just give me your bank account details.

Cheers,

Sally x

bank 3account 3your 2from 1give 1dear 1sir 1i 1will 1transfer 1money 1me 1my 1to 1

you 2the 1to 1see 1hi 1last 1night 1ill 1pay 1back 1for 1me 1your 1bank 1account 1details 1

30 words total 28 words total

1-gramsbank 3 / 30 = 1/10 words is bank for spam

Give me your bank account. I will transfer money from my bank account to your bank account.

P(bank,bank,bank|spam)

P(bank) * P(bank) * P(bank) = 1/1000

P(bank,bank,bank|ham)

P(bank) * P(bank) * P(bank) = 1/21,952

Smoothing 1-grams

24 unique words

Count each word as

(count(word) + smooth)

countWords + (smooth*uniqueWords)

Laplacian smooth - take a bit of probability away from each of our words to give to words we've not seen before.

Smoothing 1-grams

(count(word) + smooth)

countWords + (smooth*uniqueWords)

P(bank) = 1+0.1 / 28 + (0.1 * 24)

P(bank) = 1.1 / 30.4

P(sesquipedalian) = 0.1 / 30.4

Applied smoothing

P(lovely,film|spam)P(lovely) = 0 + 0.1 / 30 + (24 * 0.1)P(lovely) = 0.1 / 32.4

P(lovely) * P(film) = (0.1 / 32.4)2

P(lovely,film|ham)

P(lovely,film|ham)

P(lovely) * P(film) = (1.1 / 30.4)2

(0.1 / 32.4)2 < (1.1 / 30.4)2

Smoothed n-grams with bayes

P(A|B) = P(A)P(B|A) / P(B)

P(spam|words) = P(spam)P(words|spam) / P(words)

P(words) = P(spam)P(words|spam) + P(ham)P(words|ham)

We'll take product of all of the word probabilities as P(words) for both spam and ham, and choose whichever has the highest P(tag|words).

Applied smoothing

Email two (ham):

P(bank,bank,bank|spam) 8.75e-04 0.39 39%

P(bank,bank,bank|ham) 4.73e-05 0.02 2%

P(lovely,film|spam) 9.52e-06 0.004 0.4%

P(lovely,film|ham) 1.3e-03 0.58 58%

Email one (spam):

Priors: ham(0.5) spam(0.5)

P(spam)P(words|spam) / P(words)

P(ham)P(words|ham) / P(words)

Summary

● NLP uses features of language to statistically classify, interpret or generate language.

● Bayes rule is a mathematical method for updating your beliefs on evidence

● P(event|signal) = P(event)P(signal|event) / P(signal)

● Using smoothed n-grams is a dumb but simple spam filter

● Naïve Bayes shouldn't work: but does

Recommended