View
218
Download
2
Category
Preview:
Citation preview
Statistical Learning:Bayesian and ML
COMP155
Sections 20.1-20.2May 2, 2007
Definitions• a posteriori: derived from observed facts
• a priori: based on hypothesis or theory rather than experiment
Bayesian Learning• Make predictions using all hypotheses,
weighted by their probabilities• Bayes’ rule: P(a | b) = α P(b | a) P(a)
• For each hypothesis hi, observed data d:
• P(hi | d) = α P(d | hi) P(hi)
• P(d | hi) is the likelihood of d under hypothesis hi
• P(hi) is the hypothesis prior
α is a normalization constant = 1 / ∑i P(d | hi) P(hi)
Bayesian Learning• We want to predict some quantity X:
P(X | d) = ∑i P(X | d, hi) P(hi | d) = ∑i P(X | hi) P(hi | d)
• The predictions are weighted averages over the predictions of the individual hypotheses
Example• Suppose we know that there are 5 kinds of bags
of candy:
cherry lime % of all bags
Type 1 100% 10%
Type 2 75% 25% 20%
Type 3 50% 50% 40%
Type 4 25% 75% 20%
Type 5 100% 10%
Example: priors• Given a new bag of candy,
predict the type of the bag:
• Five hypotheses:• h1: bag is type 1, P(h1) = .1
• h2: bag is type 2, P(h2) = .2
• h3: bag is type 3, P(h3) = .4
• h4: bag is type 4, P(h4) = .2
• h5: bag is type 5, P(h5) = .1
With no evidence, we use the hypothesis priors
Example: one lime candy• Suppose we unwrap one candy and
determine that it is lime.• P(h1 | onelime) = α P(onlime | h1)P(h1)
= 0.5*(0 * 0.1) = 0
• P(h2 | onelime) = α P(onlime | h2)P(h2) = 0.5*(0.25 * 0.2) = 0.1
• P(h3 | onelime) = α P(onlime | h3)P(h3) = 0.5*(0.5 * 0.4) = 0.4
• P(h4 | onelime) = α P(onlime | h4)P(h4) = 0.5*(0.75 * 0.2) = 0.3
• P(h5 | onelime) = α P(onlime | h5)P(h5) = 0.5*(1.0 * 0.1) = 0.2
Example: two lime candies• Suppose we unwrap another candy and it
is also lime.• P(h1 | twolime) = α P(twolime | h1)P(h1)
= 0.33*(0 * 0.1) = 0
• P(h2 | twolime) = α P(twolime | h2)P(h2) = 0.33*(0.0625 * 0.2) = 0.05
• P(h3 | twolime) = α P(twolime | h3)P(h3) = 0.33*(0.25 * 0.4) = 0.4
• P(h4 | twolime) = α P(twolime | h4)P(h4) = 0.33*(0.5625 * 0.2) = 0.45
• P(h5 | twolime) = α P(twolime | h5)P(h5) = 0.33*(1.0 * 0.1) = 0.4
Example: n lime candies• Suppose we unwrap n candies and they
are all lime.• P(h1 | nlime) = αn (0n * 0.1)
• P(h2 | nlime) = αn (0.25n * 0.2)
• P(h3 | nlime) = αn (0.5n * 0.4)
• P(h4 | nlime) = αn (0.75n * 0.2)
• P(h5 | nlime) = αn (1n * 0.1)
Prediction: what candy is next?• P(nextlime | nlime) =
∑i P(nextlime | hi) P(hi | nlime)P(nextlime | h1) P(h1 | nlime) + P(nextlime | h2) P(h2 | nlime) + P(nextlime | h3) P(h3 | nlime) + P(nextlime | h4) P(h4 | nlime) + P(nextlime | h5) P(h5 | nlime) =
0 * αn (0n * 0.1) + 0.25 * αn (0.25n * 0.2) + 0.5 * αn (0.5n * 0.4) + 0.75 * αn (0.75n * 0.2) + 1 * αn (1n * 0.1)
0.97
Analysis: Bayesian Prediction• The true hypothesis eventually dominates
• The posterior probability of any false hypothesis will eventually dominate
• Probability of uncharacteristic data will become vanishingly small
• Bayesian prediction is optimal
• Bayesian prediction is expensive• Hypothesis space may be very large (or
infinite)
MAP Approximation• To avoid expense of Bayesian learning,
one approach is to simply chose the most probable hypothesis and assume it is correct• MAP = maximum a posteriori
• hmap = hi with highest value for P(hi | d)
• In candy example, after 3 limes have been selected a MAP learner will always predict next candy is lime with 100% probability• Less accurate, but much cheaper
Avoiding Complexity• As we’ve seen earlier, allowing overly
complex hypotheses can lead to overfitting
• Bayesian and MAP learning use the hypothesis prior to penalize complex hypotheses• Complex hypotheses typically have lower
priors – since there are typically more complex hypotheses
• We get the simplest hypothesis consistent with the data (as per Ockham’s razor)
ML Approximation• For large data sets, the priors become
irrelevant, in this case we may use maximum likelihood (ML) learning• Choose hml that maximizes P(d | hi)
• Choose the hypothesis that has the highest probability of causing the observed data
• identical to MAP for uniform priors
• ML is the standard (non-Bayesian) statistical learning method
Exercise• Suppose we were pulling candy from a
50/50 bag (type 3) or a 25/75 bag (type 4)
• With full Bayesian learning, what would the posterior probability and prediction plots look like after 100 candies?
• What would prediction plots look like for MAP and ML learning after 1000 candies?
Bayesian 50/50 bag
Bayesian 50/50 bag
Bayesian 25/75 bag
Bayesian 25/75 bag
MAP 50/50 bag
ML 50/50 bag
MAP 25/75 bag
ML 25/75 bag
Exercise
Answer
Recommended