Lesson 7.1 Naive Bayes Classifier

Machine Learning Gladys Castillo, UA

Naive Bayes

Gladys Castillo

University of Aveiro

Machine Learning Gladys Castillo, U.A. 2

The Supervised Classification Problem

A classifier is a function f: X C that assigns a class label c C ={c1,…, cm} to objects described by

a set of attributes X ={X1, X2, …, Xn} X

Supervised Learning Algorithm

hC

hC C(N+1)

Inputs: Attribute values of x(N+1) Output: class of x(N+1)

Classification Phase: the class attached to x(N+1) is c(N+1)=hC(x(N+1)) C

Given: a dataset D with N labeled examples of <X, C>

Build: a classifier, a hypothesis hC: X C that can

correctly predict the class labels of new objects

Learning Phase:

c(N)<x(N), c(N)>

………

c(2)<x(2), c(2)>

c(1)<x(1), c(1)>

CCXXnn……XX11DD

c(N)<x(N), c(N)>

………

c(2)<x(2), c(2)>

c(1)<x(1), c(1)>

CCXXnn……XX11DD)1(

1x )1(

nx)2(

1x)2(

nx

)(

1

Nx)( N

nx

)1(

1

Nx )1(

2

Nx)1( N

nx…

attribute X2

att

rib

ute

X1

x

x x

x

x

x x

x

x x

o o o

o o

o

o o

o

o o

o

o

give credit

don´t give

credit


Statistical Classifiers

Treat the attributes X= {X1, X2, …, Xn} and the

class C as random variables

A random variable is defined by its

probability density function

Give the probability P(cj | x) that x belongs to

a particular class rather than a simple

classification

Probability density function of a random

variable and few observations

)(xf )(xf

instead of having the map X C , we have X P(C| X)

The class c* attached to an example is the class with bigger P(cj| x)


Bayesian Classifiers

Bayesian because the class c* attached to an

example x is determined by the Bayes’ Theorem

P(X,C)= P(X|C).P(C)

P(X,C)= P(C|X).P(X)

Bayes theorem is the main tool in Bayesian inference

)(

)|()()|(

XP

CXPCPXCP

We can combine the prior distribution and the likelihood of

the observed data in order to derive the posterior

distribution.


Bayes Theorem Example

Given:

A doctor knows that meningitis causes stiff neck 50% of the time

Prior probability of any patient having meningitis is 1/50,000

Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck,

what’s the probability he/she has meningitis ?

0002.020/1

50000/15.0

)(

)()|()|(

SP

MPMSPSMP

likelihood

prior

prior

posterior prior x likelihood

)(

)|()()|(

XP

CXPCPXCP

Before observing the data, our

prior beliefs can be expressed in a

prior probability distribution that

represents the knowledge we have

about the unknown features.

After observing the data our

revised beliefs are captured by a

posterior distribution over the

unknown features.


)(

)|()()|(

x

xx

P

cPcPcP

jj

j

Bayesian Classifier

Bayes Theorem

Maximum a

posteriori

classification

How to determine P(cj | x) for each class cj ?

P(x) can be ignored because it is

the same for all the classes (a

normalization constant)

)|()(max)(1

*

jj...mj

Bayes cPcParghc xx

Classify x to the class which has bigger posterior probability

Bayesian Classification

Rule


“Naïve” because of its very naïve independence assumption:

Naïve Bayes (NB) Classifier

all the attributes are conditionally independent given the class

Duda and Hart (1973); Langley (1992)

P(x | cj) can be decomposed into a product of n terms, one

term for each attribute

“Bayes” because the class c* attached to an example x is

determined by the Bayes’ Theorem

)|()(max)(1

*

jj...mj

Bayes cPcParghc xx

when the attribute space is high dimensional direct

estimation is hard unless we introduce some assumptions

n

i

jiij...mj

NB cxXPcParghc1

1

* )|()(max)(xNB

Classification Rule


Xi continuous

The attribute is discretized and then treats as a discrete attribute

A Normal distribution is usually assumed

2. Estimate P(Xi=xk |cj) for each value xk of the attribute Xi and for each class cj

Xi discrete

1. Estimate P(cj) for each class cj

Naïve Bayes (NB) Learning Phase (Statistical Parameter Estimation)

),;()|(ijijkjki xgcxXP

Given a training dataset D of N labeled examples (assuming complete data)

N

NcP

j

j )(ˆ

j

ijk

jkiN

NcxXP )|(ˆ

2

2

)(2

2

1),;( e

x

xg

onde

Nj - number of examples of the class cj

Nijk - number of examples of the class cj

having the value xk for the attribute Xi

Two

options

The mean ij and the standard deviation ij are estimated from D


Estimate P(Xi=xk |cj) for a value of the attribute Xi and for each class cj

A Normal distribution is usually assumed

Xi | cj ~ N(μij, σ2ij) - the mean ij and the standard deviation ij are estimated from D

Continuous Attributes Normal or Gaussian Distribution

For a variable X ~ N(74, 36), the probability

of observing the value 66 is given by:

f(x) = g(66; 74, 6) = 0.0273

The d

ensity

pro

bability

functio

n

- 2 323

N(0,2)

- 2 323

N(0,2)

f(x) is symmetrical around its mean

),;()|(ijijkjki xgcxXP

2

2

)(2

2

1),;( e

x

xg


1. Estimate P(cj) for each class cj

2. Estimate P(Xi=xk |cj) for each value of X1 and each class cj

X1 discrete X2 continuos

Naïve Bayes Probability Estimates

)101 1.73 ()|( 2 .,x;gCxXP

Tra

inin

g d

ata

set

D

5

3)(ˆ CP

3

2)|(ˆ CaXP i

Class X1 X2

+ a 1.0

+ b 1.2

+ a 3.0

- b 4.4

- b 4.5

Two classes: + (positive) , - (negative)

Two attributes:

X1 – discrete which takes values a and b

X2 – continuous

Binary Classification Problem

3

1)|(ˆ CbXP i

5

2)(ˆ CP

)070 4.45 ()|( 2 .,x;gCxXP

2

0)|(ˆ CaXP i

2

2)|(ˆ CbXP i

2+= 1.73, 2+ = 1.10

2-= 4.45, 22+ = 0.07

Example from John & Langley (1995)


Probability Estimates Discrete Attributes

for each class: P(No) = 7/10

P(Yes) = 3/10

for each attribute value and class:

Examples:

P(Status=Married|No) = 4/7

P(Refund=Yes|Yes)=0

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book

N

NcP

j

j )(ˆ

j

ijk

jkiN

NcxXP )|(ˆ

To compute Nijk we need to count the

number of examples of the class cj

having the value xk for the attribute Xi


for each pair attribute-class (Xi, cj)

Example

for (Income, Evade = No)

If Evade = No

sample mean = 110

sample standard deviation = 54.54

0072.0)54.54(2

1)|120( )2975(2

)110120( 2

eNoIncomeP

Probability Estimates Continuos Attributes

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

Adapted from © Tan,Steinbach, Kumar, Introduction to Data Mining book

2

2

)(

2

2

1) ()|( e ij

ijkx

ij

ijijkjki ,;xgcxXP


The Balance-scale Problem

Each example has 4 numerical attributes:

the left weight (Left_W)

the left distance (Left_D)

the right weight (Right_W)

right distance (Right_D)

The dataset was generated to model psychological experimental results

Left_W

Each example is classified into 3 classes: the balance-scale:

tip to the right (Right)

tip to the left (Left)

is balanced (Balanced)

3 target rules:

If LD x LW > RD x RW tip to the left

If LD x LW < RD x RW tip to the right

If LD x LW = RD x RW it is balanced

Balance-scale dataset from the UCI repository

Right_W

Left_D Right_D

https://archive.ics.uci.edu/ml/datasets/Balance+Scale






The Balance-scale Problem

Left_W Left_D

Right_W Right_D

Balanced2643

...

2

1

LeftLeft_W_W

BalanceBalance--Scale DataSetScale DataSet

............

Left235

Right245

ClassClassRightRight_D_DRightRight_W_WLeftLeft_D_D

Balanced2643

...

2

1

LeftLeft_W_W


............

Left235

Right245


Adapted from © João Gama’s slides “Aprendizagem Bayesiana”

Discretization is applied: each attribute is mapped to 5 intervals


Balanced2643

...

2

1

LeftLeft_W_W


............

Left235

Right245


Balanced2643

...

2

1

LeftLeft_W_W


............

Left235

Right245


The Balance-scale Problem Learning Phase

Build

Contingency Tables

Contingency Tables

AttributeAttribute: : LeftLeft_W_W

2534496686RightRight

9108810BalancedBalanced

7271614214LeftLeft

I5I5I4I4I3I3I2I2I1I1ClassClass

AttributeAttribute: : LeftLeft_W_W



7271614214LeftLeft


AttributeAttribute: : LeftLeft_D_D



7770593816LeftLeft


AttributeAttribute: : LeftLeft_D_D



7770593816LeftLeft


AttributeAttribute: : RightRight_W_W



2833496387LeftLeft


AttributeAttribute: : RightRight_W_W



2833496387LeftLeft


AttributeAttribute: : RightRight_D_D



2535446591LeftLeft


AttributeAttribute: : RightRight_D_D



2535446591LeftLeft


260

LeftLeft

Classes Classes CountersCounters

56526045

TotalTotalRightRightBalancedBalanced

260

LeftLeft

Classes Classes CountersCounters

56526045

TotalTotalRightRightBalancedBalanced

565 examples

Assuming complete data, the computation

of all the required estimates requires a

simple scan through the data, an operation

of time complexity O(N x n), where N is

the number of training examples and n is

the number of attributes.


We need to estimate the posterior probabilities P(cj | x) for each class

How NB classifies this example?

The class for this example is the class which has bigger posterior probability

P(Left|x) P(Balanced|x) P(Right|x)

0.277796 0.135227 0.586978 Class = Right

P(cj |x) = P(cj ) x P(Left_W=1 | cj) x P(Left_D=5 | cj ) x P(Right_W=4 | cj ) x P(Right_D=2 | cj ),

cj {Left, Balanced, Right}

The class counters and contingency tables are used to

compute the posterior probabilities for each class

1

LeftLeft_W_W

Right245


1

LeftLeft_W_W

Right245


?

max

The Balance-scale Problem Classification Phase

n

i

jiij...mj

NB cxXPcParghc1

1

* )|()(max)(x


Iris Dataset

From the statistician Douglas Fisher Three flower types (classes):

Setosa

Virginica

Versicolour

Four continuous attributes Sepal width and length

Petal width and length

Virginica. Robert H. Mohlenbrock. USDA

NRCS. 1995. Northeast wetland flora: Field

office guide to plant species. Northeast National

Technical Center, Chester, PA. Courtesy of

USDA NRCS Wetland Science Institute.

Iris dataset from the UCI repository

https://archive.ics.uci.edu/ml/datasets/Iris




Iris Dataset


Iris Dataset


Scatter Plot Array of Iris Attributes

The attributes petal width and

petal length provide a moderate separation of

the Irish species


Naïve Bayes Iris Dataset – Continuous Attributes

Normal Probability Density Functions

Attribute: PetalWidth

P(PetalWidth|Setosa)

P(PetalWidth|Versicolor)

P(PetalWidth|Virginica)

Class Iris-setosa mean: 0.244, standard deviation: 0.107

Class Iris-versicolor mean: 1.326, standard deviation: 0.198

Class Iris-virginica

mean: 2.026, standard deviation: 0.275


Class Iris-versicolor (0.327): ============================== Attribute sepallength mean: 5.936, standard deviation: 0.516 Attribute sepalwidth mean: 2.770, standard deviation: 0.314 Attribute petallength mean: 4.260, standard deviation: 0.470 Attribute petalwidth mean: 1.326, standard deviation: 0.198


NB Classifier Model

Class Iris-setosa (0.327): ========================== Attribute sepallength mean: 5.006, standard deviation: 0.352 Attribute sepalwidth mean: 3.418, standard deviation: 0.381 Attribute petallength mean: 1.464, standard deviation: 0.174 Attribute petalwidth mean: 0.244, standard deviation: 0.107

Class Iris-virginica (0.327): ============================= Attribute sepallength mean: 6.588, standard deviation: 0.636 Attribute sepalwidth mean: 2.974, standard deviation: 0.322 Attribute petallength mean: 5.552, standard deviation: 0.552 Attribute petalwidth mean: 2.026, standard deviation: 0.275



Classification Phase

maximum values



Estimate P(cj | x) for each class:

P(setosa|x) P(versicolor|x) P(virginica|x)

0 0.995 0.005 Classe = versicolor

P(cj |x) = P(cj ) x P(sepalLength =5 | cj) x P(sepalWidth =3 | cj ) x P(petalLength =2 | cj ) x P(petalWidth =2 | cj ), cj {setosa, versicolor, virgínica}

?

max

How NB classifies this example?

The class for this example is the class which has bigger posterior probability



Estimate P(cj | x) for the versicolor class Class Iris-versicolor (0.327): ============================== Attribute sepallength mean: 5.936, standard deviation: 0.516 Attribute sepalwidth mean: 2.770, standard deviation: 0.314 Attribute petallength mean: 4.260, standard deviation: 0.470 Attribute petalwidth mean: 1.326, standard deviation: 0.198

P(versicolor |x) = P(versicolor ) x P(sepalLength = 5 | versicolor ) x P(sepalWidth =3 | versicolor) x P(petalLength =2 | versicolor) x P(petalWidth =2 | versicolor )

P(versicolor |x) = 0.327 x g(5; 5.936, 0.516) x g(3; 2.770, 0.314) x g(2; 4.260, 0.470) x g(2; 1.326, 0.198)

n

i

jiij...mj

NB cxXPcParghc1

1

* )|()(max)(x


Naïve Bayes Continuous Attributes - Discretization

For continuos attributes

After BinDiscretization, bins=3



NB Classifier Model

Class Iris-versicolor (0.327): ============================== Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000 Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040 Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020

Class Iris-setosa (0.327): ========================== Attribute sepallength range1: 0.940 range2: 0.060 range3: 0.000 Attribute sepalwidth range1: 0.020 range2: 0.720 range3: 0.260 Attribute petallength range1: 1.000 range2: 0.000 range3: 0.000 Attribute petalwidth range1: 1.000 range2: 0.000 range3: 0.000

Class Iris-virginica (0.327): ============================= Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidth range1: 0.380 range2: 0.580 range3: 0.040 Attribute petallength range1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidth range1: 0.000 range2: 0.100 range3: 0.900




Class Iris-versicolor (0.327): ============================== Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000 Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040 Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020 Class Iris-virginica (0.327): ============================= Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidth range1: 0.380 range2: 0.580 range3: 0.040 Attribute petallength range1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidth range1: 0.000 range2: 0.100 range3: 0.900

Class Probabilities

Setosa Versicolor Virgínica

0.327 0.327 0.327

Attribute: Sepallength

Range1 Range2 Range3

Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Attribute: Sepalwidth


Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

We can build a conditional probability table (CPT) for each attribute




Class Iris-versicolor (0.327): ============================== Attribute sepallength range1: 0.220 range2: 0.720 range3: 0.060 Attribute sepalwidth range1: 0.460 range2: 0.540 range3: 0.000 Attribute petallength range1: 0.000 range2: 0.960 range3: 0.040 Attribute petalwidth range1: 0.000 range2: 0.980 range3: 0.020 Class Iris-virginica (0.327): ============================= Attribute sepallength range1: 0.020 range2: 0.640 range3: 0.340 Attribute sepalwidth range1: 0.380 range2: 0.580 range3: 0.040 Attribute petallength range1: 0.000 range2: 0.120 range3: 0.880 Attribute petalwidth range1: 0.000 range2: 0.100 range3: 0.900

Attribute: Petallength


Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Attribute: Petalwidth


Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900


Naïve Bayes Iris Dataset – Discretized Attributes

Classification (Implementation) Phase

maximum values

Discretized Examples


Naïve Bayes Classification Phase

How to classify a new example ?

P(setosa|x) = P(setosa ) x P(sepalLength = r1 | setosa ) x P(sepalWidth =r2 | setosa)

x P(petalLength = r1 | setosa) x P(petalWidth =r3 | setosa )

P(versicolor|x) = P(versicolor) x P(sepalLength = r1 | versicolor )

x P(sepalWidth =r2 | setosa) x P(petalLength = r1 | versicolor)

x P(petalWidth =r3 | versicolor )

P(virginica|x) = P(virginica) x P(sepalLength = r1 | virginica)

x P(sepalWidth =r2 | virginica) x P(petalLength = r1 | virginica)

x P(petalWidth =r3 | virginica )

Compute the conditional posterior probabilities for each class

sepallengt sepalwidth petallength petallength

r1 r2 r1 r3

Example with discretized attributes

sepallengt sepalwidth petallength petallength

5 3 2 2



Class Probabilities


0.327 0.327 0.327

Class Probabilities


0.327 0.327 0.327

Atributo Sepallength


Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340



Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepalwidth


Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Sepalwidth


Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Petallength


Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900



Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petalwidth


Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

Atributo Petalwidth


Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

P(setosa | x) = P(setosa ) x P(sepalLength = r1 | setosa ) x P(sepalWidth = r2 | setosa ) x P(petalLength = r1 | setosa ) x P(petalWidth = r3 | setosa )

P(setosa | x) = 0.327 x 0.940 x 0.720 x 1.000 x 0.000 = 0

r1

petallengthpetallength

r1

sepallengtsepallengt

r3r2

petallengthpetallengthsepalwidthsepalwidth

r1


r1


r3r2




Class Probabilities


0.327 0.327 0.327

Class Probabilities


0.327 0.327 0.327



Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340



Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepalwidth


Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Sepalwidth


Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040



Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900



Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petalwidth


Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

Atributo Petalwidth


Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

P(versicolor | x) = P(versicolor) x P(sepalLength = r1 | versicolor ) x P(sepalWidth = r2 | versicolor ) x P(petalLength = r1 | versicolor ) x P(petalWidth = r3 | versicolor )

P(versicolor | x) = 0.327 x 0.220 x 0.540 x 0.000 x 0.020 = 0

r1


r1


r3r2


r1


r1


r3r2




Class Probabilities


0.327 0.327 0.327

Class Probabilities


0.327 0.327 0.327



Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340



Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

Atributo Sepalwidth


Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040

Atributo Sepalwidth


Setosa 0.020 0.720 0.260

Versicolor 0.460 0.540 0.000

Virginica 0.380 0.580 0.040



Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900



Setosa 1.000 0.000 0.000

Versicolor 0.000 0.960 0.040

Virginica 0.000 0.100 0.900

Atributo Petalwidth


Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

Atributo Petalwidth


Setosa 1.000 0.000 0.000

Versicolor 0.000 0.980 0.020

Virginica 0.000 0.100 0.900

P(virginica | x) = P(virginica) x P(sepalLength = r1 | virginica ) x P(sepalWidth = r2 | virginica ) x P(petalLength = r1 | virginica ) x P(petalWidth = r3 | virginica )

P(virginica | x) = 0.327 x 0.020 x 0.580 x 0.000 x 0.900 = 0

r1


r1


r3r2


r1


r1


r3r2


If all the class probabilities are zero

we can not determine the class for

this example


Naïve Bayes Laplace Correction

Nijk number of examples in D such that

Xi = xk and C = cj

Nj number of examples in D of class cj

k number of possible values of Xi



Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340



Setosa 0.940 0.060 0.000

Versicolor 0.220 0.720 0.060

Virginica 0.020 0.640 0.340

To avoid zero probabilities

due to zero counters we

can use the Laplace

correction

j

ijk

jkiN

NcxXP )|(ˆ

kN

NcxXP

j

ijk

jki

1)|(ˆ

To calculate the conditional probabilities, instead of using the estimate

we can use the Laplace correction:


Bayesian SPAM Filtering

Binary Classification Problem Is an email a SPAM?

1 – is SPAM

0 – is not SPAM

57 real attributes Word frequencies

Character frequencies

Capital letters avg frequencies

Spambase dataset from the UCI repository

Spam Not Spam Total

1813 2788 4601

https://archive.ics.uci.edu/ml/datasets/Spambase





Bayesian SPAM Filtering An Implementation using RapidMiner Studio 6.0.007

Greedy Feature Selection

using “forward selection”

and CFS(1) as the

performance measure

(1) - Hall, M. A. (1998). Correlation-based Feature Subset Selection for Machine Learning. Thesis submitted in partial fulfillment of the requirements of the degree of Doctor of Philosophy at the University of Waikato.


Bayesian SPAM Filtering k-fold Cross-Validation Results (k= 10)

Results without feature subset selection

Results with feature subset selection

- only 11 attributes selected, performs better


Bayesian SPAM Filters Confussion Matrix

Spam Precision = 𝑇𝑃

𝑇𝑃+𝐹𝑃

percentage of e-mails classified as SPAM which truly are

Spam Recall = TPR = 𝑇𝑃

𝑇𝑃+𝐹𝑁

percentage of e-mails classified as SPAM over the total of SPAMs

Not Spam Precision = 𝑇𝑁

𝐹𝑁+𝑇𝑁

percentage of e-mails classified as “Not SPAM” which truly are

Not Spam Recall = 𝑇𝑁

𝐹𝑃+𝑇𝑁

percentage of e-mails classified as “Not SPAM” over the total of

valid emails

True Positive (TP) - number of e-mails classified

as SPAM which truly are

False Positive = number of e-mails classified as

SPAM which are valid

False Negative = number of e-mails classified as

“Not SPAM” which are SPAM

True Negative – number of e-mails classified as

“Not Spam” which truly are

Spam precision

Not Spam precision

Spam recall Not Spam recall

Actual

Predicted

Yes (+) No (-)

Yes (+) TP FP

No (-) FN TN

Concept Learning Problem: Is an email a SPAM?

Accuracy: 91.74%

Predicted Spam (1) Not Spam (0) Precision

Spam (1) 1558 125 92.57%

Not Spam (0) 255 2633 91.25%

Recall 85.93% 95.52%

A good Spam filter must

achieve high precision and

recall, thus reducing the

number of false positives


NB is one of more simple and effective classifiers

NB has a very strong unrealistic independence assumption:

Naïve Bayes Performance

all the attributes are conditionally independent given the value of class

Bias-variance decomposition of Test Error

Naive Bayes for Nursery Dataset

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

18.00%

500 2000 3500 5000 6500 8000 9500 11000 12500

# Examples

Variance

Biasin practice: independence assumption

is violated HIGH BIAS

it can lead to poor classification

However, NB is efficient due to its

high variance management

less parameters LOW VARIANCE


References

1. Castillo G. (2006) Adaptive Learning Algorithms for Bayesian Network Classifiers. PhD Thesis.

Chapter 3.3. Naïve Bayes Classifier

2. Domingos, P. and Pazzani M. (1997) On the optimality of the simple Bayesian classifier under

zero-one loss, Machine Learning, 29(2-3):103–130

3. Duda, R.O. and Hart, P. E. (1973) Pattern Classification and Scene Analysis. John Wiley &

Sons, Inc.

4. John, G.H. and Langley P. (1995) Estimating Continuous Distributions in Bayesian Classifiers.

Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. pp. 338-345.

Morgan Kaufmann, San Mateo.

5. Mitchell T. (1997) Machine Learning, Chapter 6. Bayesian Learning, Tom Mitchell, McGrau Hill

6. Tan, P., Steinbach, M., Kumar, V. (2006) Introduction to Data Mining . Chapter 5:

Classification- Alternative Techniques. Section 5.1 (pag 166-180) , Addison Wesley Higher

Education