Naïve Bayes Classifier. Red = 2.125 Yellow = 6.143 Mass = 134.32 Volume = 24.21 Apple Sensors, scales, etc… 8/29/03Bayesian Classifier2

STATISTICAL APPROACH TO

CLASSIFICATIONNaïve Bayes Classifier

Bayesian Classifier 2

Remember…

Red = 2.125

Yellow = 6.143

Mass = 134.32

Volume = 24.21

Apple

Sensors, scales, etc…

8/29/03

http://images.google.com/imgres?imgurl=masseynews.massey.ac.nz/2002/masseynews/nov/nov11/images/apple.jpg&imgrefurl=http://masseynews.massey.ac.nz/2002/news_release/05_11_02.html&h=228&w=200&prev=/images?q=+fruit+apple&svnum=10&hl=en&lr=&ie=UTF-8&oe=UTF-8

http://images.google.com/imgres?imgurl=www.preschooleducation.com/postcard/pic/orange.gif&imgrefurl=http://www.preschooleducation.com/postcard/ecfruit.shtml&h=454&w=460&prev=/images?q=fruit+orange&svnum=10&hl=en&lr=&ie=UTF-8&oe=UTF-8

http://images.google.com/imgres?imgurl=www.14ushop.com/grace/fruit/Lemon.jpg&imgrefurl=http://www.14ushop.com/grace/fruit-patterns.html&h=291&w=360&prev=/images?q=fruit+lemon&svnum=10&hl=en&lr=&ie=UTF-8&oe=UTF-8

http://images.google.com/imgres?imgurl=www.14ushop.com/grace/fruit/Peach.jpg&imgrefurl=http://www.14ushop.com/grace/art-patterns.html&h=360&w=355&prev=/images?q=fruit+peach&svnum=10&hl=en&lr=&ie=UTF-8&oe=UTF-8


0 2 4 6 8

0.0

00

00

.00

05

0.0

01

00

.00

15

0.0

02

0

Distribution of Redness Values

Redness

De

nsi

ty

Fruit

ApplesPeachesOrangesLemons

Redness Let’s look at one dimension

For a given redness value which is the most probable fruit

8/29/03


Redness What if we wanted to ask the question

“what is the probability that some fruit with a given redness value is an apple?”

8/29/03

0 2 4 6 8

0.0

00

00

.00

05

0.0

01

00

.00

15

0.0

02

0


Redness

De

nsi

ty

Fruit

ApplesPeachesOrangesLemons Could we just

look at how far away it is from the apple peak?

Is it the highest PDF above the X-value in question?


Redness of Apples and Oranges

Redness

1 2 3 4 5 6 7

05

01

00

15

02

00

25

0

Probability it’s an apple

If a fruit has a redness of 4.05 do we know the probability that it’s an apple?

What do we know?

8/29/03

• We know the total number of fruit at that redness (10+25)

• We know the fraction of apples at that redness (10)

• Probability that a fruit with a redness value of 4.05 is an apple is

• If it is a histogram of counts then it straight forward• Probability it’s an apple

28.57%• Probability it’s an orange

71.43%• Getting the probability is simple


But what if working PDF

Probability density function Continuous Probability not count Might be tempted

to use the same approach

8/29/03

0 2 4 6 8

0.0

00

00

.00

05

0.0

01

00

.00

15

0.0

02

0


Redness

De

nsi

ty

Fruit


P(a fruit with redness 4.05 is apple)?=

Parametric ( and parameters)

vs. non-parametric


Problem

What if had trillion oranges and only 100 apples

Might be the most common apple and have a higher value at 4.05 than oranges even though the universe would have way more oranges at that value

8/29/03

0 2 4 6 8

0.0

00

00

.00

05

0.0

01

00

.00

15

0.0

02

0


Redness

De

nsi

ty

Fruit


Wouldn’t change the PDFsbut…



Redness

1 2 3 4 5 6 7

05

01

00

15

02

00

25

0

Let’s revisit but using probabilities instead of counts

2506 apples 2486 oranges If a fruit has a

redness of 4.05 do we know the probability that it’s an apple if we don’t have specific counts at 4.05?

8/29/03

Conditional Probability

If we know it is an apple, then the…But what we want


Bayes Theorem

8/29/03

Above from the book h is hypothesis, D is training Data

𝑃 (h|𝐷 )=𝑃 (𝐷|h ) 𝑃 (h)

𝑃 (𝐷)

𝑃 (𝑎𝑝𝑝𝑙𝑒|𝑟𝑒𝑑𝑛𝑒𝑠𝑠=4.05 )=𝑃 (𝑟𝑒𝑑𝑛𝑒𝑠𝑠=4.05|𝑎𝑝𝑝𝑙𝑒 )𝑃 (𝐴𝑝𝑝𝑙𝑒)

𝑃 (𝑅𝑒𝑑𝑛𝑒𝑠𝑠=4.05)

Does this make sense?



Redness

1 2 3 4 5 6 7

05

01

00

15

02

00

25

0

Make Sense? 2506 apples 2486 oranges Probability that

redness would be 4.05 if know an apple About 10/2506

P(apple)? 2506/(2506+2486)

P(redness=4.05) About

(10+25)/(2506+2486)

8/29/03

𝑃 (𝑎𝑝𝑝𝑙𝑒|𝑟𝑒𝑑𝑛𝑒𝑠𝑠=4.05 )=𝑃 (𝑟𝑒𝑑𝑛𝑒𝑠𝑠=4.05|𝑎𝑝𝑝𝑙𝑒 )𝑃 (𝐴𝑝𝑝𝑙𝑒)

𝑃 (𝑅𝑒𝑑𝑛𝑒𝑠𝑠=4.05)

10(10+25)

=

102506

∙2506

(2506+2486)(10+25)

(2506+2486)

?


Can find the probability

Whether have counts or PDF How do we classify?

Simply find the most probable class

8/29/03

h=argmaxh∈𝐻

𝑃 (h∨𝐷)


Bayes

I think of the ratio of P(h) to P(D) as an adjustment to the easily determined P(D|h) in order to account for differences in sample size

8/29/03

𝑃 (h|𝐷 )=𝑃 (𝐷|h ) 𝑃 (h)

𝑃 (𝐷)

Prior Probabilities or Priors

Posterior Probability


MAP Maximum a posteriori hypothesis (MAP)

ä-(ˌ)pō-ˌstir-ē-ˈor-ē Relating to or derived by reasoning from

observed facts; inductive A priori: relating to or derived by reasoning

from self-evident propositions; deductive Approach: Brute-force MAP learning

algorithm

8/29/03

h𝑀𝐴𝑃=argmaxh∈𝐻

𝑃 (h∨𝐷)


More is better

Mass (normalized)0 1 2 3 4 5 6 7 8 9 10

12

34

56

78

910

Red Inte

nsi

ty (

norm

aliz

ed)

More dimensions can be helpful

8/29/03

Linearly Separable


























What if some of the dims disagree

Color (red and yellow) says apple but mass and volume say orange?

Take a vote?

8/29/03

How handle multiple dimensions?


Can cheat

Assume each dimension is independent (doesn’t co-vary with any other dimension)

Can use the product rule The probability that a fruit is an apple

given a set of measurements (dimensions) is:

8/29/03

P (h|𝐷𝑟𝑒𝑑 )∗P (h|𝐷𝑦𝑒𝑙𝑙𝑜𝑤 )∗ P (h|𝐷𝑣𝑜𝑙𝑢𝑚𝑒 )∗ P (h|𝐷𝑚𝑎𝑠𝑠 )


Naïve Bayes Classifier Known as a Naïve Bayes Classifier

Where vj is class and ai is an attribute Derivation

8/29/03

𝑣𝑁𝐵=argmax𝑣 𝑗∈𝑉

𝑃 (𝑣 𝑗)∏𝑖

𝑃 (𝑎𝑖∨𝑣 𝑗)

Where is the denominator?

Bayesian Classifier 188/29/03

Example You wish to classify an instance with the

following attributes 1.649917 5.197862 134.898820 16.137695

The first column is redness, then yellowness, followed by mass then volume

The training data has in the redness histogram bin in which the instance falls 0 apples, 0 peaches, 9 oranges, and 22 lemons

In the bin for yellowness there are 235, 262, 263, and 239

In the bin for mass there are 106, 176, 143, and 239

In the bin for vol there are What 3, 57, 7, and 184

What are each of the probabilities that it is an • Apple• Peach• Orange• Lemon


Solution

Red Yellow Mass Vol

Apples

0 235 106 3

peaches

0 262 176 57

oranges

9 263 143 7

lemons

22 239 239 184

Total 31 999 664 251

apples

0 0.24 0.16 0.01 0

peaches

0 0.26 0.27 0.23 0

oranges

0.29 0.26 0.22 0.03 0.0005

lemons

0.71 0.24 0.36 0.73 0.0044


Zeros Is it really a zero percent chance that it’s an apple? Are these really probabilities

(hint: 0.0005 + 0.0044 not equal to 1)? What of the bin size?

Red Yellow Mass Vol

apples

0 235 106 3

peaches

0 262 176 57

oranges

9 263 143 7

lemons

22 239 239 184

Total 31 999 664 251

apples

0 0.24 0.16 0.01 0

peaches

0 0.26 0.27 0.23 0

oranges

0.29 0.26 0.22 0.28 0.0004

lemons

0.71 0.24 0.36 0.73 0.0044


Zeros

Estimating probabilities is an estimate of the probability m-estimate The choice of m is often some upper bound

to n and p is often 1/m This ensures a numerator is at least 1 (never

zero) Denominator starts at upper bound and goes

up to twice that No loss of order, would be zeros are very

small


Curse of dimensionality

Do too many dimensions hurt?

What if only some dimensions contribute to ability to classify? What would the other dimensions do to the probabilities?


All about representation

With imagination and innovation can learn to classify many things you wouldn’t expect

What if you wanted to learn to classify documents, how might you go about it?


Example

Learning to classify text Collect all words in examples Calculate P(vj) and P(wk|vj) Each instance will be a vector of size |

vocabulary| Classes (v’s) (category) Each word (w) is a dimension

𝑣 𝑁𝐵=argmax𝑣 𝑗∈𝑉

𝑃 (𝑣𝑗) ∏𝑖∈ 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠

𝑃 (𝑎𝑖∨𝑣𝑗)


Paper

20 News groups 1000 training documents from each

group The groups were the classes

89% classification accuracy

89 out of every 100 times could tell which newsgroup a document came from


Another example: RNA

Rift Valley fever virus Basically RNA (like DNA but with

an extra oxygen – the D in DNA is deoxy)

Encapsulated in a protein sheath Important protein involved in the

encapsulation process Nucleocapsid


SELEX SELEX (Systematic Evolution of Ligands

by Exponential Enrichment) Identify RNA segments that have a high

affinity for nucleocapsid (aptamer vs. non-aptamer)


Could we build a classifier

Each known aptamer was 30 nucleotides long A 30 character string

4 nucleotides (ACGU) What would the data

look-like How would we “bin” the

data?


Discrete or real valued?

Have seen Fruit example Documents RNA (nucleotides)

Which is best for Bayesian?

Integers

StringsFloating Point


Results


Gene Expression Experiments

The brighter the spot, the greater the mRNA concentration


Can we use expression profiles to detect disease

Thousands of genes (dimensions) Many genes not affected (distributions for

disease and normal same in that dimension)

gene

patient

g1 g2 g3 … gndisea

se

p1 x1,1 x1,2 x1,3 … x1,n Y

p2 x2,1 x2,2 x2,3 … x2,n N...

.

.

.

pm xm,1 xm,2 xm,3 … xm,n ?


Rare Moss Growth Conditions

Perhaps at good growth locations pH Average temperature Average sunlight

exposure Salinity Average length of day

What else? What would the data

look-like?


Proof Taken from “Pattern Recognition” third edition

Sergios Theodoridis and Konstantinos Koutroumbas

The Bayesian classifier is optimal with respect to minimizing the classification error probability

Proof: let R1 be the region of the feature space in which we decide tin favor of w1 and R2 be the corresponding region for w2. Then an error is made if although it belongs to w2 of if although it belongs to w1.

𝑃𝑒=𝑃 (𝑥∈𝑅2 ,𝑤1 )+𝑃 (𝑥∈𝑅1 ,𝑤2 )

0 2 4 6 8

0.0

00

00

.00

05

0.0

01

00

.00

15

0.0

02

0


Redness

De

nsi

ty

Fruit



Proof Joint probability

Using Bayes Rule

It is now easy to see that the error is minimized if the partitioning regions R1 and R2 of the feature space are chosen so that:


Proof Indeed, since the union of the regions R1,

R2 covers all the space, from the definition of a probability density function we have that

Combining

This suggests that the probability of error is minimized if R1 is the region of space in which . Then R2 becomes the region where the reverse is true.