Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro

Aprendizagem Computacional Gladys Castillo, UA

Bayesian Networks Classifiers

Gladys CastilloUniversity of Aveiro



Part I – Naive BayesPart II - Introduction to Bayesian

Networks

Aprendizagem Computacional Gladys Castillo, U.A. Aprendizagem Computacional Gladys Castillo, U.A.3

The Supervised Classification Problem

A classifier is a function f: X C that assigns a class label c C ={c1,…, cm} to objects described by a

set of attributes X ={X1, X2, …, Xn} X

Supervised Learning Algorithm

hC

hC C(N+1)

Inputs: Attribute values of x(N+1) Output: class of x(N+1)

Classification Phase: the class attached to x(N+1) is c(N+1)=hC(x(N+1)) C

Given: a dataset D with N labeled examples of <X, C>Build: a classifier, a hypothesis hC: X C that can correctly predict the class labels of new objects

Learning Phase:

c(N)<x(N), c(N)>

………

c(2)<x(2), c(2)>

c(1)<x(1), c(1)>

CCXXnn……XX11DD

c(N)<x(N), c(N)>

………

c(2)<x(2), c(2)>

c(1)<x(1), c(1)>

CCXXnn……XX11DD)1(

1x )1(nx

)2(1x )2(

nx

)(1

Nx )( Nnx

)1(1

Nx )1(2

Nx )1( Nnx…

attribute X2

att

ribu

te

X1

x

x x

x

x

xx

x

xx

oo o

oo

o

oo

o

o o

o

o

give credit

don´t give

credit


Statistical Classifiers Treat the attributes X= {X1, X2, …, Xn}

and the class C as random variables A random variable is defined by the

probability density function Give the probability P(cj | x) that x

belongs to a particular class rather than a simple classification

Probability density function of a random variable and few observations

)(xf )(xf

instead of having the map X C , we have X P(C| X)

The class c* attached to an example is the class with bigger P(cj|

x)


Bayesian Classifiers

Bayesian because the class c* attached to an example x is determined by the Bayes’ TheoremP(X,C)= P(X|

C).P(C)P(X,C)= P(C|

X).P(X)Bayes theorem is the main tool in Bayesian inference

)(

)|()()|(

XP

CXPCPXCP

We can combine the prior distribution and the likelihood of the observed data in order to derive the posterior

distribution.


Bayes TheoremExample

Given: A doctor knows that meningitis causes stiff neck 50% of

the time Prior probability of any patient having meningitis is

1/50,000 Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck, what’s the probability he/she has meningitis?0002.0

20/150000/15.0

)()()|(

)|( SP

MPMSPSMP

likelihood

prior

prior

posterior prior x likelihood

)(

)|()()|(

XP

CXPCPXCP

Before observing the data, our prior beliefs can be expressed

in a prior probability distribution that represents the knowledge we have about the

unknown features. After observing the data our

revised beliefs are captured by a posterior distribution over

the unknown features.


)(

)|()()|(

x

xx

P

cPcPcP jj

j

Bayesian Classifier

Bayes Theorem

Maximum a posteriori

classification

How to determine P(cj | x) for each class cj ?

P(x) can be ignored because is just a normalizing constant that ensures the

posterior adds up to 1; it can be computed by summing up the numerator

over all possible values of C, i.e.,P(x) = j P(c j ) P(x |cj)

)|()(max)(1

*jj

...mjBayes cPcParghc xx

Classify x to the class which has bigger posterior

probability

Bayesian Classificatio

n Rule


“Naïve” because of its very naïve independence assumption:

Naïve Bayes (NB) Classifier

all the attributes are conditionally independent given the class

Duda and Hart (1973); Langley (1992)

P(x | cj) can be decomposed into a product

of n terms, one term for each attribute

“Bayes” because the class c* attached to an example x is

determined by the Bayes’ Theorem

)|()(max)(1

*jj

...mjBayes cPcParghc xx

when the attribute space is high dimensional direct estimation is hard unless

we introduce some assumptions

n

ijiij

...mjNB cxXPcParghc

11

* )|()(max)(xNB

Classification Rule


Xi continuous The attribute is discretized and then treats as a discrete

attribute A Normal distribution is usually assumed

2. Estimate P(Xi=xk |cj) for each value xk of the attribute Xi and for each class cj

Xi discrete

1. Estimate P(cj) for each class cj

Naïve Bayes (NB)Learning Phase (Statistical Parameter Estimation)

),;()|( ijijkjki xgcxXP

Given a training dataset D of N labeled examples (assuming complete data)

N

NcP j

j )(ˆ

j

ijkjki N

NcxXP )|(ˆ

2

2

)(2

2

1),;( e

x

xg

ond

e

Nj - the number of examples of the class cj

Nijk - number of examples of the class cj

having the value xk for the attribute Xi

Two options

The mean ij e the standard deviation ij are estimated from D


2. Estimate P(Xi=xk |cj) for a value of the attribute Xi and for each class cj

For real attributes a Normal distribution is usually assumed

Xi | cj ~ N(μij, σ2ij) - the mean ij e the standard deviation ij are estimated from D

Continuous Attributes Normal or Gaussian Distribution

For a variable X ~ N(74, 36), the probability of observing the value 66 is given by: f(x) = g(66; 74, 6) = 0.0273

The d

ensity

pro

bability

fu

nctio

n

- 2 323

N(0,2)

- 2 323

N(0,2)

f(x) is symmetrical around its mean

),;()|( ijijkjki xgcxXP 2

2

)(2

2

1),;( e

x

xg


1. Estimate P(cj) for each class cj

2. Estimate P(Xi=xk |cj) for each value of X1 and each class cj

X1 discrete X2 continuos

Naïve Bayes Probability Estimates

)101 1.73 ()|( 2 .,x;gCxXP

Train

ing d

ata

set D

5

3)(ˆ CP

3

2)|(ˆ CaXP i

Class X1 X2

+ a 1.0

+ b 1.2

+ a 3.0

- b 4.4

- b 4.5

Two classes: + (positive) e - (negative)

Two attributes: X1 – discrete which takes values a e b X2 – continuos

Binary Classification Problem

3

1)|(ˆ CbXP i

5

2)(ˆ CP

)070 4.45 ()|( 2 .,x;gCxXP 2

0)|(ˆ CaXP i 2

2)|(ˆ CbXP i

2+= 1.73, 2+ = 1.10

2-= 4.45, 22+ = 0.07

Example from John & Langley (1995)


NB is one of more simple and effective classifiers

NB has a very strong unrealistic independence assumption:

Naïve Bayes Performance

all the attributes are conditionally independent given the value of class

Bias-variance decomposition of Test Error Naive Bayes for Nursery Dataset

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

18.00%

500 2000 3500 5000 6500 8000 9500 11000 12500

# Examples

Variance

Bias in practice: independence assumption

is violated HIGH BIAS

it can lead to poor classification

However, NB is efficient due to its high

variance management

less parameters LOW VARIANCE


reducing the bias resulting from the modeling error by relaxing the attribute independence assumption

one natural extension: Bayesian Network Classifiers

Improving Naïve Bayes reducing the bias of the parameter estimates

by improving the probability estimates computed from data

Rele

van

t w

ork

s:

Web and Pazzani (1998) - “Adjusted probability naive Bayesian induction” in LNCS v 1502

J. Gama (2001, 2003) - “Iterative Bayes”, in Theoretical Computer Science, v. 292

Friedman, Geiger and Goldszmidt (1998) “Bayesian Network Classifiers” in Machine Learning, 29 Pazzani (1995) - “Searching for attribute dependencies in Bayesian Network Classifiers” in Proc.

of the 5th Workshop of Artificial Intelligence and Statistics Keogh and Pazzani (1999) - “Learning augmented Bayesian classifiers…”, in Theoretical

Computer Science, v. 292Rele

van

t w

ork

s:



Part II - Introduction to Bayesian Networks


Reasoning under UncertaintyProbabilistic Approach

A problem domain is modeled by a set of random variables X1, X2 . . . , Xn

Knowledge about the problem domain is represented by a joint probability distribution P(X1, X2, . . . , Xn)Example: ALARM network (Pearl 1988)

Story: In LA, burglary and earthquake are not uncommon. They both can cause alarm. In case of alarm, two neighbors John and Mary may call

Problem: Estimate the probability of a burglary based who has or has not called

Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M)

Knowledge required by the probabilistic approach in order to solve this problem: P(B, E, A, J, M)


To specify P(X1, X2, . . . , Xn) we need at least 2n − 1 numbers probability

exponential model size. Knowledge acquisition difficult (complex, unnatural) Exponential storage and inference

Inference withJoint Probability Distribution

by exploiting conditional independence we can obtain

an appropriate factorization of the joint probability

distribution

the number of parameters could be substantially

reduced

A good factorization is provided with Bayesian

Networks

Solution:


Bayesian Networks

i

iin XpaXPXXP ))(|(),...,( 1

the structure S - a Directed Acyclic Graph (DAG)

nodes – random variables arcs - direct dependence

between variables

Bayesian Networks (BNs) graphically represent the joint probability distribution of a set X of random variables in a problem domain

A BN=(S, S) consist in two

components:

the set of parameters S

= set of conditional probability distributions (CPDs) discrete variables:

S - multinomial (CPDs are CPTs)

continuos variables: S - Normal ou Gaussiana

Qualitative part Quantitative part

each node is independent of its non descendants given its parents in S

Wet grass BN from Hugin

Pearl (1988, 2000); Jensen (1996); Lauritzen (1996)

Probability theory

Graph theory

Markov condition :

The joint distribution is factorized


the structure – a Directed Acyclic Graph

the parameters – a set of conditional probability tables (CPTs) pa(B) = { }, pa(E) = { }, pa(A) = {B, E}, pa(J) = {A}, pa(M)

= {A}

X = { B - Burglary, E- Earthquake, A - Alarm, J - JohnCalls, M – MaryCalls }The Bayesian Network has two components:

Example: Alarm Network (Pearl)Bayesian Network

P(B) P(E) P(A | B, E) P(J | A) P( M | A}

Model size reduced from 31 to

1+1+4+2+2=10


P(B, E, A, J, M) = P(B) P(E) P(A|B, E) P(J|A) P(M|A)

X = {B (Burglary), (E) Earthquake, (A) Alarm, (J) JohnCalls, (M) MaryCalls}

Example: Alarm Network (Pearl)Factored Joint Probability Distribution


Bayesian Networks: Summary

Efficient: Local models Independence (d-separation)

Effective: Algorithms take advantage of structure to

Compute posterior probabilities Compute most probable instantiation Decision making

But there is more: statistical induction LEARNING

adapted from © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved .

Bayesian Networks: an efficient and effective representation of the joint probability distribution of a set of

random variables

Documents

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro