41
Machine Learning Gladys Castillo, UA Naive Bayes Gladys Castillo University of Aveiro

Lesson 7.1 Naive Bayes Classifier

Embed Size (px)

DESCRIPTION

The objective of this lesson is to introduce the popular Naive Bayes classification model. The list of references used in the preparation of these slides can be found at the course website page: https://sites.google.com/site/gladyscjaprendizagem/program/bayesian-network-classifiers

Citation preview

Page 1: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, UA

Naive Bayes

Gladys Castillo

University of Aveiro

Page 2: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 2

The Supervised Classification Problem

A classifier is a function f: X C that assigns a class label c C ={c1,…, cm} to objects described by

a set of attributes X ={X1, X2, …, Xn} X

Supervised Learning Algorithm

hC

hC C(N+1)

Inputs: Attribute values of x(N+1) Output: class of x(N+1)

Classification Phase: the class attached to x(N+1) is c(N+1)=hC(x(N+1)) C

Given: a dataset D with N labeled examples of <X, C>

Build: a classifier, a hypothesis hC: X C that can

correctly predict the class labels of new objects

Learning Phase:

c(N)<x(N), c(N)>

………

c(2)<x(2), c(2)>

c(1)<x(1), c(1)>

CCXXnn……XX11DD

c(N)<x(N), c(N)>

………

c(2)<x(2), c(2)>

c(1)<x(1), c(1)>

CCXXnn……XX11DD)1(

1x )1(

nx)2(

1x)2(

nx

)(

1

Nx)( N

nx

)1(

1

Nx )1(

2

Nx)1( N

nx…

attribute X2

att

rib

ute

X1

x

x x

x

x

x x

x

x x

o o o

o o

o

o o

o

o o

o

o

give credit

don´t give

credit

Page 3: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 3

Statistical Classifiers

Treat the attributes X= {X1, X2, …, Xn} and the

class C as random variables

A random variable is defined by its

probability density function

Give the probability P(cj | x) that x belongs to

a particular class rather than a simple

classification

Probability density function of a random

variable and few observations

)(xf )(xf

instead of having the map X C , we have X P(C| X)

The class c* attached to an example is the class with bigger P(cj| x)

Page 4: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 4

Bayesian Classifiers

Bayesian because the class c* attached to an

example x is determined by the Bayes’ Theorem

P(X,C)= P(X|C).P(C)

P(X,C)= P(C|X).P(X)

Bayes theorem is the main tool in Bayesian inference

)(

)|()()|(

XP

CXPCPXCP

We can combine the prior distribution and the likelihood of

the observed data in order to derive the posterior

distribution.

Page 5: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 5

Bayes Theorem Example

Given:

A doctor knows that meningitis causes stiff neck 50% of the time

Prior probability of any patient having meningitis is 1/50,000

Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck,

what’s the probability he/she has meningitis ?

0002.020/1

50000/15.0

)(

)()|()|(

SP

MPMSPSMP

likelihood

prior

prior

posterior prior x likelihood

)(

)|()()|(

XP

CXPCPXCP

Before observing the data, our

prior beliefs can be expressed in a

prior probability distribution that

represents the knowledge we have

about the unknown features.

After observing the data our

revised beliefs are captured by a

posterior distribution over the

unknown features.

Page 6: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 6

)(

)|()()|(

x

xx

P

cPcPcP

jj

j

Bayesian Classifier

Bayes Theorem

Maximum a

posteriori

classification

How to determine P(cj | x) for each class cj ?

P(x) can be ignored because it is

the same for all the classes (a

normalization constant)

)|()(max)(1

*

jj...mj

Bayes cPcParghc xx

Classify x to the class which has bigger posterior probability

Bayesian Classification

Rule

Page 7: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 7

“Naïve” because of its very naïve independence assumption:

Naïve Bayes (NB) Classifier

all the attributes are conditionally independent given the class

Duda and Hart (1973); Langley (1992)

P(x | cj) can be decomposed into a product of n terms, one

term for each attribute

“Bayes” because the class c* attached to an example x is

determined by the Bayes’ Theorem

)|()(max)(1

*

jj...mj

Bayes cPcParghc xx

when the attribute space is high dimensional direct

estimation is hard unless we introduce some assumptions

n

i

jiij...mj

NB cxXPcParghc1

1

* )|()(max)(xNB

Classification Rule

Page 8: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 8

Xi continuous

The attribute is discretized and then treats as a discrete attribute

A Normal distribution is usually assumed

2. Estimate P(Xi=xk |cj) for each value xk of the attribute Xi and for each class cj

Xi discrete

1. Estimate P(cj) for each class cj

Naïve Bayes (NB) Learning Phase (Statistical Parameter Estimation)

),;()|(ijijkjki xgcxXP

Given a training dataset D of N labeled examples (assuming complete data)

N

NcP

j

j )(ˆ

j

ijk

jkiN

NcxXP )|(ˆ

2

2

)(2

2

1),;( e

x

xg

onde

Nj - number of examples of the class cj

Nijk - number of examples of the class cj

having the value xk for the attribute Xi

Two

options

The mean ij and the standard deviation ij are estimated from D

Page 9: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 9

Estimate P(Xi=xk |cj) for a value of the attribute Xi and for each class cj

A Normal distribution is usually assumed

Xi | cj ~ N(μij, σ2ij) - the mean ij and the standard deviation ij are estimated from D

Continuous Attributes Normal or Gaussian Distribution

For a variable X ~ N(74, 36), the probability

of observing the value 66 is given by:

f(x) = g(66; 74, 6) = 0.0273

The d

ensity

pro

bability

functio

n

- 2 323

N(0,2)

- 2 323

N(0,2)

f(x) is symmetrical around its mean

),;()|(ijijkjki xgcxXP

2

2

)(2

2

1),;( e

x

xg

Page 10: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 10

1. Estimate P(cj) for each class cj

2. Estimate P(Xi=xk |cj) for each value of X1 and each class cj

X1 discrete X2 continuos

Naïve Bayes Probability Estimates

)101 1.73 ()|( 2 .,x;gCxXP

Tra

inin

g d

ata

set

D

5

3)(ˆ CP

3

2)|(ˆ CaXP i

Class X1 X2

+ a 1.0

+ b 1.2

+ a 3.0

- b 4.4

- b 4.5

Two classes: + (positive) , - (negative)

Two attributes:

X1 – discrete which takes values a and b

X2 – continuous

Binary Classification Problem

3

1)|(ˆ CbXP i

5

2)(ˆ CP

)070 4.45 ()|( 2 .,x;gCxXP

2

0)|(ˆ CaXP i

2

2)|(ˆ CbXP i

2+= 1.73, 2+ = 1.10

2-= 4.45, 22+ = 0.07

Example from John & Langley (1995)

Page 11: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 11

Probability Estimates Discrete Attributes

for each class: P(No) = 7/10

P(Yes) = 3/10

for each attribute value and class:

Examples:

P(Status=Married|No) = 4/7

P(Refund=Yes|Yes)=0

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book

N

NcP

j

j )(ˆ

j

ijk

jkiN

NcxXP )|(ˆ

To compute Nijk we need to count the

number of examples of the class cj

having the value xk for the attribute Xi

Page 12: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 12

for each pair attribute-class (Xi, cj)

Example

for (Income, Evade = No)

If Evade = No

sample mean = 110

sample standard deviation = 54.54

0072.0)54.54(2

1)|120( )2975(2

)110120( 2

eNoIncomeP

Probability Estimates Continuos Attributes

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book

2

2

)(

2

2

1) ()|( e ij

ijkx

ij

ijijkjki ,;xgcxXP

Page 13: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 13

The Balance-scale Problem

Each example has 4 numerical attributes:

the left weight (Left_W)

the left distance (Left_D)

the right weight (Right_W)

right distance (Right_D)

The dataset was generated to model psychological experimental results

Left_W

Each example is classified into 3 classes: the balance-scale:

tip to the right (Right)

tip to the left (Left)

is balanced (Balanced)

3 target rules:

If LD x LW > RD x RW tip to the left

If LD x LW < RD x RW tip to the right

If LD x LW = RD x RW it is balanced

Balance-scale dataset from the UCI repository

Right_W

Left_D Right_D

Page 14: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 14

The Balance-scale Problem

Left_W Left_D

Right_W Right_D

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

Adapted from © João Gama’s slides “Aprendizagem Bayesiana”

Discretization is applied: each attribute is mapped to 5 intervals

Page 15: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 15

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

The Balance-scale Problem Learning Phase

Build

Contingency Tables

Contingency Tables

AttributeAttribute: : LeftLeft_W_W

2534496686RightRight

9108810BalancedBalanced

7271614214LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : LeftLeft_W_W

2534496686RightRight

9108810BalancedBalanced

7271614214LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : LeftLeft_D_D

2737495790RightRight

8109108BalancedBalanced

7770593816LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : LeftLeft_D_D

2737495790RightRight

8109108BalancedBalanced

7770593816LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : RightRight_W_W

7970583716RightRight

8910108BalancedBalanced

2833496387LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : RightRight_W_W

7970583716RightRight

8910108BalancedBalanced

2833496387LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : RightRight_D_D

8267573717RightRight

9108108BalancedBalanced

2535446591LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : RightRight_D_D

8267573717RightRight

9108108BalancedBalanced

2535446591LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

260

LeftLeft

Classes Classes CountersCounters

56526045

TotalTotalRightRightBalancedBalanced

260

LeftLeft

Classes Classes CountersCounters

56526045

TotalTotalRightRightBalancedBalanced

565 examples

Assuming complete data, the computation

of all the required estimates requires a

simple scan through the data, an operation

of time complexity O(N x n), where N is

the number of training examples and n is

the number of attributes.

Page 16: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 16

We need to estimate the posterior probabilities P(cj | x) for each class

How NB classifies this example?

The class for this example is the class which has bigger posterior probability

P(Left|x) P(Balanced|x) P(Right|x)

0.277796 0.135227 0.586978 Class = Right

P(cj |x) = P(cj ) x P(Left_W=1 | cj) x P(Left_D=5 | cj ) x P(Right_W=4 | cj ) x P(Right_D=2 | cj ),

cj {Left, Balanced, Right}

The class counters and contingency tables are used to

compute the posterior probabilities for each class

1

LeftLeft_W_W

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

1

LeftLeft_W_W

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

?

max

The Balance-scale Problem Classification Phase

n

i

jiij...mj

NB cxXPcParghc1

1

* )|()(max)(x

Page 17: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 17

Iris Dataset

From the statistician Douglas Fisher Three flower types (classes):

Setosa

Virginica

Versicolour

Four continuous attributes Sepal width and length

Petal width and length

Virginica. Robert H. Mohlenbrock. USDA

NRCS. 1995. Northeast wetland flora: Field

office guide to plant species. Northeast National

Technical Center, Chester, PA. Courtesy of

USDA NRCS Wetland Science Institute.

Iris dataset from the UCI repository

Page 18: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 18

Iris Dataset

Page 19: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 19

Iris Dataset

Page 20: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 20

Scatter Plot Array of Iris Attributes

The attributes petal width and

petal length provide a moderate separation of

the Irish species

Page 21: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 21

Naïve Bayes Iris Dataset – Continuous Attributes

Normal Probability Density Functions

Attribute: PetalWidth

P(PetalWidth|Setosa)

P(PetalWidth|Versicolor)

P(PetalWidth|Virginica)

Class Iris-setosa mean: 0.244, standard deviation: 0.107

Class Iris-versicolor mean: 1.326, standard deviation: 0.198

Class Iris-virginica

mean: 2.026, standard deviation: 0.275

Page 22: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 22

Class Iris-versicolor (0.327): ============================== Attribute sepallength mean: 5.936, standard deviation: 0.516 Attribute sepalwidth mean: 2.770, standard deviation: 0.314 Attribute petallength mean: 4.260, standard deviation: 0.470 Attribute petalwidth mean: 1.326, standard deviation: 0.198

Naïve Bayes Iris Dataset – Continuous Attributes

NB Classifier Model

Class Iris-setosa (0.327): ========================== Attribute sepallength mean: 5.006, standard deviation: 0.352 Attribute sepalwidth mean: 3.418, standard deviation: 0.381 Attribute petallength mean: 1.464, standard deviation: 0.174 Attribute petalwidth mean: 0.244, standard deviation: 0.107

Class Iris-virginica (0.327): ============================= Attribute sepallength mean: 6.588, standard deviation: 0.636 Attribute sepalwidth mean: 2.974, standard deviation: 0.322 Attribute petallength mean: 5.552, standard deviation: 0.552 Attribute petalwidth mean: 2.026, standard deviation: 0.275

Page 23: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 23

Naïve Bayes Iris Dataset – Continuous Attributes

Classification Phase

maximum values

Page 24: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 24

Naïve Bayes Iris Dataset – Continuous Attributes

Estimate P(cj | x) for each class:

P(setosa|x) P(versicolor|x) P(virginica|x)

0 0.995 0.005 Classe = versicolor

P(cj |x) = P(cj ) x P(sepalLength =5 | cj) x P(sepalWidth =3 | cj ) x P(petalLength =2 | cj ) x P(petalWidth =2 | cj ), cj {setosa, versicolor, virgínica}

?

max

How NB classifies this example?

The class for this example is the class which has bigger posterior probability

Page 25: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 25

Naïve Bayes Iris Dataset – Continuous Attributes

Estimate P(cj | x) for the versicolor class Class Iris-versicolor (0.327): ============================== Attribute sepallength mean: 5.936, standard deviation: 0.516 Attribute sepalwidth mean: 2.770, standard deviation: 0.314 Attribute petallength mean: 4.260, standard deviation: 0.470 Attribute petalwidth mean: 1.326, standard deviation: 0.198

P(versicolor |x) = P(versicolor ) x P(sepalLength = 5 | versicolor ) x P(sepalWidth =3 | versicolor) x P(petalLength =2 | versicolor) x P(petalWidth =2 | versicolor )

P(versicolor |x) = 0.327 x g(5; 5.936, 0.516) x g(3; 2.770, 0.314) x g(2; 4.260, 0.470) x g(2; 1.326, 0.198)

n

i

jiij...mj

NB cxXPcParghc1

1

* )|()(max)(x

Page 26: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 26

Naïve Bayes Continuous Attributes - Discretization

For continuos attributes

After BinDiscretization, bins=3

Page 27: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 27

Naïve Bayes Continuous Attributes - Discretization

NB Classifier Model

Class Iris-versicolor (0.327): ============================== Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000 Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040 Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020

Class Iris-setosa (0.327): ========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260 Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000 Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000

Class Iris-virginica (0.327): ============================= Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidth range1: 0.380 range2: 0.580 range3: 0.040 Attribute petallength range1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidth range1: 0.000 range2: 0.100 range3: 0.900

Page 28: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 28

Naïve Bayes Continuous Attributes - Discretization

Class Iris-setosa (0.327): ========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260 Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000 Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000

Class Iris-versicolor (0.327): ============================== Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000 Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040 Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020 Class Iris-virginica (0.327): ============================= Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidth range1: 0.380 range2: 0.580 range3: 0.040 Attribute petallength range1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidth range1: 0.000 range2: 0.100 range3: 0.900

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Attribute: Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Attribute: Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

We can build a conditional probability table (CPT) for each attribute

Page 29: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 29

Naïve Bayes Continuous Attributes - Discretization

Class Iris-setosa (0.327): ========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260 Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000 Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000

Class Iris-versicolor (0.327): ============================== Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000 Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040 Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020 Class Iris-virginica (0.327): ============================= Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidth range1: 0.380 range2: 0.580 range3: 0.040 Attribute petallength range1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidth range1: 0.000 range2: 0.100 range3: 0.900

Attribute: Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Attribute: Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

Page 30: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 30

Naïve Bayes Iris Dataset – Discretized Attributes

Classification (Implementation) Phase

maximum values

Discretized Examples

Page 31: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 31

Naïve Bayes Classification Phase

How to classify a new example ?

P(setosa|x) = P(setosa ) x P(sepalLength = r1 | setosa ) x P(sepalWidth =r2 | setosa)

x P(petalLength = r1 | setosa) x P(petalWidth =r3 | setosa )

P(versicolor|x) = P(versicolor) x P(sepalLength = r1 | versicolor )

x P(sepalWidth =r2 | setosa) x P(petalLength = r1 | versicolor)

x P(petalWidth =r3 | versicolor )

P(virginica|x) = P(virginica) x P(sepalLength = r1 | virginica)

x P(sepalWidth =r2 | virginica) x P(petalLength = r1 | virginica)

x P(petalWidth =r3 | virginica )

Compute the conditional posterior probabilities for each class

sepallengt sepalwidth petallength petallength

r1 r2 r1 r3

Example with discretized attributes

sepallengt sepalwidth petallength petallength

5 3 2 2

Page 32: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 32

Naïve Bayes Continuous Attributes - Discretization

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

P(setosa | x) = P(setosa ) x P(sepalLength = r1 | setosa ) x P(sepalWidth = r2 | setosa ) x P(petalLength = r1 | setosa ) x P(petalWidth = r3 | setosa )

P(setosa | x) = 0.327 x 0.940 x 0.720 x 1.000 x 0.000 = 0

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

Page 33: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 33

Naïve Bayes Continuous Attributes - Discretization

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

P(versicolor | x) = P(versicolor) x P(sepalLength = r1 | versicolor ) x P(sepalWidth = r2 | versicolor ) x P(petalLength = r1 | versicolor ) x P(petalWidth = r3 | versicolor )

P(versicolor | x) = 0.327 x 0.220 x 0.540 x 0.000 x 0.020 = 0

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

Page 34: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 34

Naïve Bayes Continuous Attributes - Discretization

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Sepalwidth

Range1 Range2 Range3

Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petallength

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

Atributo Petalwidth

Range1 Range2 Range3

Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

P(virginica | x) = P(virginica) x P(sepalLength = r1 | virginica ) x P(sepalWidth = r2 | virginica ) x P(petalLength = r1 | virginica ) x P(petalWidth = r3 | virginica )

P(virginica | x) = 0.327 x 0.020 x 0.580 x 0.000 x 0.900 = 0

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

If all the class probabilities are zero

we can not determine the class for

this example

Page 35: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 35

Naïve Bayes Laplace Correction

Nijk number of examples in D such that

Xi = xk and C = cj

Nj number of examples in D of class cj

k number of possible values of Xi

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

To avoid zero probabilities

due to zero counters we

can use the Laplace

correction

j

ijk

jkiN

NcxXP )|(ˆ

kN

NcxXP

j

ijk

jki

1)|(ˆ

To calculate the conditional probabilities, instead of using the estimate

we can use the Laplace correction:

Page 36: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 36

Bayesian SPAM Filtering

Binary Classification Problem Is an email a SPAM?

1 – is SPAM

0 – is not SPAM

57 real attributes Word frequencies

Character frequencies

Capital letters avg frequencies

Spambase dataset from the UCI repository

Spam Not Spam Total

1813 2788 4601

Page 37: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 37

Bayesian SPAM Filtering An Implementation using RapidMiner Studio 6.0.007

Greedy Feature Selection

using “forward selection”

and CFS(1) as the

performance measure

(1) - Hall, M. A. (1998). Correlation-based Feature Subset Selection for Machine Learning. Thesis submitted in partial fulfillment of the requirements of the degree of Doctor of Philosophy at the University of Waikato.

Page 38: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 38

Bayesian SPAM Filtering k-fold Cross-Validation Results (k= 10)

Results without feature subset selection

Results with feature subset selection

- only 11 attributes selected, performs better

Page 39: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 39

Bayesian SPAM Filters Confussion Matrix

Spam Precision = 𝑇𝑃

𝑇𝑃+𝐹𝑃

percentage of e-mails classified as SPAM which truly are

Spam Recall = TPR = 𝑇𝑃

𝑇𝑃+𝐹𝑁

percentage of e-mails classified as SPAM over the total of SPAMs

Not Spam Precision = 𝑇𝑁

𝐹𝑁+𝑇𝑁

percentage of e-mails classified as “Not SPAM” which truly are

Not Spam Recall = 𝑇𝑁

𝐹𝑃+𝑇𝑁

percentage of e-mails classified as “Not SPAM” over the total of

valid emails

True Positive (TP) - number of e-mails classified

as SPAM which truly are

False Positive = number of e-mails classified as

SPAM which are valid

False Negative = number of e-mails classified as

“Not SPAM” which are SPAM

True Negative – number of e-mails classified as

“Not Spam” which truly are

Spam precision

Not Spam precision

Spam recall Not Spam recall

Actual

Predicted

Yes (+) No (-)

Yes (+) TP FP

No (-) FN TN

Concept Learning Problem: Is an email a SPAM?

Accuracy: 91.74%

Predicted Spam (1) Not Spam (0) Precision

Spam (1) 1558 125 92.57%

Not Spam (0) 255 2633 91.25%

Recall 85.93% 95.52%

A good Spam filter must

achieve high precision and

recall, thus reducing the

number of false positives

Page 40: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 40

NB is one of more simple and effective classifiers

NB has a very strong unrealistic independence assumption:

Naïve Bayes Performance

all the attributes are conditionally independent given the value of class

Bias-variance decomposition of Test Error

Naive Bayes for Nursery Dataset

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

18.00%

500 2000 3500 5000 6500 8000 9500 11000 12500

# Examples

Variance

Biasin practice: independence assumption

is violated HIGH BIAS

it can lead to poor classification

However, NB is efficient due to its

high variance management

less parameters LOW VARIANCE

Page 41: Lesson 7.1  Naive Bayes Classifier

Machine Learning Gladys Castillo, U.A. 41

References

1. Castillo G. (2006) Adaptive Learning Algorithms for Bayesian Network Classifiers. PhD Thesis.

Chapter 3.3. Naïve Bayes Classifier

2. Domingos, P. and Pazzani M. (1997) On the optimality of the simple Bayesian classifier under

zero-one loss, Machine Learning, 29(2-3):103–130

3. Duda, R.O. and Hart, P. E. (1973) Pattern Classification and Scene Analysis. John Wiley &

Sons, Inc.

4. John, G.H. and Langley P. (1995) Estimating Continuous Distributions in Bayesian Classifiers.

Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. pp. 338-345.

Morgan Kaufmann, San Mateo.

5. Mitchell T. (1997) Machine Learning, Chapter 6. Bayesian Learning, Tom Mitchell, McGrau Hill

6. Tan, P., Steinbach, M., Kumar, V. (2006) Introduction to Data Mining . Chapter 5:

Classification- Alternative Techniques. Section 5.1 (pag 166-180) , Addison Wesley Higher

Education