BAYESIAN CLASSIFICATION
Bayesian Classification A statistical classifier
Probabilistic prediction Predict class membership probabilities
Based on Bayes Theorem Naive Bayesian classifier
comparable performance with decision tree and selected neural network classifiers
Accuracy and Speed is good when applied to large databases
Incremental
Nave Bayesian Classifier Class Conditional Independence Effect of an attribute value on a given class is
independent of the values of other attributes Simplifies Computations
Bayesian Belief Networks Graphical models Represent dependencies among subsets of
attributes
Bayesian Theorem: Basics Let X be a data sample class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), the probability
that the hypothesis holds given the observed data sample X Posterior Probability
P(H) (prior probability), the initial probability P(X): probability that sample data is observed P(X|H) (posteriori probability), the probability of
observing the sample X, given that the hypothesis holds
X Round and Red Fruit H Apple
Bayesian Theorem Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes theorem
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes
Practical difficulty: require initial knowledge of many probabilities, significant computational cost
Nave Bayesian Classifier Let D be a training set of tuples and their associated
class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, , xn)
Suppose there are m classes C1, C2, , Cm. Classification is to derive the maximum posteriori,
i.e., the maximal P(Ci|X) This can be derived from Bayes theorem
Nave Bayesian Classifier Since P(X) is constant for all classes, only
needs to be maximized Can assume that all classes are equally likely and
maximize P(X|Ci) A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
Derivation of Nave Bayes Classifier This greatly reduces the computation cost: Only
counts the class distribution If Ak is categorical, P(xk|Ci) = sik /si where sik is the #
of tuples in Ci having value xk for Ak and si is the number of training samples belonging to Ci
If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean and standard deviation
P(xk|Ci) is g(xk, Ci Ci)
Example P(Ci): P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) = 5/14= 0.357 Compute P(X|Ci) for each class P(age =
Disadvantages Assumption: class conditional independence,
therefore loss of accuracy Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modeled by Nave Bayesian Classifier
Bayesian Belief Networks Models dependencies between variables Defined by Two components
Directed Acyclic Graph Conditional Probability Table (CPT) for each
variable Bayesian belief network allows a subset of the
variables to be conditionally independent
A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability
distribution
Bayesian Belief Network: An Example of Training Bayesian Networks
Several scenarios: Given both the network structure and all variables
observable: learn only the CPTs Network structure known, some hidden variables:
gradient descent (greedy hill-climbing) method, analogous to neural network learning
Network structure unknown, all variables observable: search through the model space to reconstruct network topology
Unknown structure, all hidden variables: No good algorithms known for this purpose
BAYESIAN CLASSIFICATIONBayesian Classification A statistical classifier Probabilistic prediction Predict class membership probabilities
Based on Bayes Theorem Naive Bayesian classifier comparable performance with decision tree and selected neural network classifiers Accuracy and Speed is good when applied to large databases
Incremental Nave Bayesian Classifier Class Conditional Independence Effect of an attribute value on a given class is independent of the values of other attributes Simplifies Computations
Bayesian Belief Networks Graphical models Represent dependencies among subsets of attributes
Bayesian Theorem: Basics Let X be a data sample class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), the probability that the hypothesis holds given the observed data sample X Posterior Probability
P(H) (prior probability), the initial probability P(X): probability that sample data is observed P(X|H) (posteriori probability), the probability of observing the sample X, given that the hypothesis holds X Round and Red Fruit H Apple
Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes Practical difficulty: require initial knowledge of many probabilities, significant computational cost
Nave Bayesian Classifier Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, , xn) Suppose there are m classes C1, C2, , Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) This can be derived from Bayes theorem
Nave Bayesian Classifier Since P(X) is constant for all classes, only needs to be maximized Can assume that all classes are equally likely and maximize P(X|Ci) A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes):
Derivation of Nave Bayes Classifier This greatly reduces the computation cost: Only counts the class distribution If Ak is categorical, P(xk|Ci) = sik /si where sik is the # of tuples in Ci having value xk for Ak and si is the number of training samples belonging to Ci If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean and standard deviation P(xk|Ci) is g(xk, (Ci, Ci)
Example P(Ci): P(buys_computer = yes) = 9/14 = 0.643P(buys_computer = no) = 5/14= 0.357 Compute P(X|Ci) for each classP(age =