Download pdf - Data Mining- Bayesian Classification

BAYESIAN CLASSIFICATION

Bayesian Classification A statistical classifier

Probabilistic prediction Predict class membership probabilities

Based on Bayes Theorem Naive Bayesian classifier

comparable performance with decision tree and selected neural network classifiers

Accuracy and Speed is good when applied to large databases

Incremental

Nave Bayesian Classifier Class Conditional Independence Effect of an attribute value on a given class is

independent of the values of other attributes Simplifies Computations

Bayesian Belief Networks Graphical models Represent dependencies among subsets of

attributes

Bayesian Theorem: Basics Let X be a data sample class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), the probability

that the hypothesis holds given the observed data sample X Posterior Probability

P(H) (prior probability), the initial probability P(X): probability that sample data is observed P(X|H) (posteriori probability), the probability of

observing the sample X, given that the hypothesis holds

X Round and Red Fruit H Apple

Bayesian Theorem Given training data X, posteriori probability of a

hypothesis H, P(H|X), follows the Bayes theorem

Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes

Practical difficulty: require initial knowledge of many probabilities, significant computational cost

Nave Bayesian Classifier Let D be a training set of tuples and their associated

class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, , xn)

Suppose there are m classes C1, C2, , Cm. Classification is to derive the maximum posteriori,

i.e., the maximal P(Ci|X) This can be derived from Bayes theorem

Nave Bayesian Classifier Since P(X) is constant for all classes, only

needs to be maximized Can assume that all classes are equally likely and

maximize P(X|Ci) A simplified assumption: attributes are conditionally

independent (i.e., no dependence relation between attributes):

Derivation of Nave Bayes Classifier This greatly reduces the computation cost: Only

counts the class distribution If Ak is categorical, P(xk|Ci) = sik /si where sik is the #

of tuples in Ci having value xk for Ak and si is the number of training samples belonging to Ci

If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean and standard deviation

P(xk|Ci) is g(xk, Ci Ci)

Example P(Ci): P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) = 5/14= 0.357 Compute P(X|Ci) for each class P(age =

X = (age

Disadvantages Assumption: class conditional independence,

therefore loss of accuracy Practically, dependencies exist among variables

E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

Dependencies among these cannot be modeled by Nave Bayesian Classifier

Bayesian Belief Networks Models dependencies between variables Defined by Two components

Directed Acyclic Graph Conditional Probability Table (CPT) for each

variable Bayesian belief network allows a subset of the

variables to be conditionally independent

A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability

distribution

Bayesian Belief Network: An Example of Training Bayesian Networks

Several scenarios: Given both the network structure and all variables

observable: learn only the CPTs Network structure known, some hidden variables:

gradient descent (greedy hill-climbing) method, analogous to neural network learning

Network structure unknown, all variables observable: search through the model space to reconstruct network topology

Unknown structure, all hidden variables: No good algorithms known for this purpose

BAYESIAN CLASSIFICATIONBayesian Classification A statistical classifier Probabilistic prediction Predict class membership probabilities

Based on Bayes Theorem Naive Bayesian classifier comparable performance with decision tree and selected neural network classifiers Accuracy and Speed is good when applied to large databases

Incremental Nave Bayesian Classifier Class Conditional Independence Effect of an attribute value on a given class is independent of the values of other attributes Simplifies Computations

Bayesian Belief Networks Graphical models Represent dependencies among subsets of attributes

Bayesian Theorem: Basics Let X be a data sample class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), the probability that the hypothesis holds given the observed data sample X Posterior Probability

P(H) (prior probability), the initial probability P(X): probability that sample data is observed P(X|H) (posteriori probability), the probability of observing the sample X, given that the hypothesis holds X Round and Red Fruit H Apple

Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes Practical difficulty: require initial knowledge of many probabilities, significant computational cost

Nave Bayesian Classifier Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, , xn) Suppose there are m classes C1, C2, , Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) This can be derived from Bayes theorem

Nave Bayesian Classifier Since P(X) is constant for all classes, only needs to be maximized Can assume that all classes are equally likely and maximize P(X|Ci) A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes):

Derivation of Nave Bayes Classifier This greatly reduces the computation cost: Only counts the class distribution If Ak is categorical, P(xk|Ci) = sik /si where sik is the # of tuples in Ci having value xk for Ak and si is the number of training samples belonging to Ci If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean and standard deviation P(xk|Ci) is g(xk, (Ci, Ci)

Example P(Ci): P(buys_computer = yes) = 9/14 = 0.643P(buys_computer = no) = 5/14= 0.357 Compute P(X|Ci) for each classP(age =