1 Bayesian Classification 2 Outline Statistics Basics Naive Bayes Bayesian Network Applications

  • View
    213

  • Download
    0

Embed Size (px)

Transcript

  • *Bayesian Classification

  • *OutlineStatistics BasicsNaive BayesBayesian NetworkApplications

  • 81.

    311+2 24 3*

  • 1 2 3 122(:)13 Monty Hall problem*

  • *Goal: Mining Probability ModelsProbability BasicsOur state s in world W, is distributed according to probability distribution0
  • *Weather data set

    OutlookTemperatureHumidityWindyPlaysunnyhothighFALSEnosunnyhothighTRUEnoovercasthothighFALSEyesrainymildhighFALSEyesrainycoolnormalFALSEyesrainycoolnormalTRUEnoovercastcoolnormalTRUEyessunnymildhighFALSEnosunnycoolnormalFALSEyesrainymildnormalFALSEyessunnymildnormalTRUEyesovercastmildhighTRUEyesovercasthotnormalFALSEyesrainymildhighTRUEno

  • *BasicsUnconditional or Prior ProbabilityPr(Play=yes) + Pr(Play=no)=1Pr(Play=yes) is sometimes written as Pr(Play)Table has 9 yes, 5 noPr(Play=yes)=9/(9+5)=9/14Thus, Pr(Play=no)=5/14Joint Probability of Play and Windy: Pr(Play=x,Windy=y) for all values x and y, should be 1

    3/14?3/146/14

  • *Probability BasicsConditional ProbabilityPr(A|B)# (Windy=False)=8Within the 8, #(Play=yes)=6Pr(Play=yes | Windy=False) =6/8Pr(Windy=False)=8/14Pr(Play=Yes)=9/14Applying Bayes RulePr(B|A) = Pr(A|B)Pr(B) / Pr(A)Pr(Windy=False|Play=yes)=6/8*8/14/(9/14)=6/9

    WindyPlay*FALSEnoTRUEno*FALSE*yes*FALSE*yes*FALSE*yesTRUEnoTRUEyes*FALSEno*FALSE*yes*FALSE*yesTRUEyesTRUEyes*FALSE*yesTRUEno

  • *Conditional IndependenceA and P are independent given CPr(A | P,C) = Pr(A | C)CavityProbeCatchesAcheC A P ProbabilityF F F 0.534F F T 0.356F T F 0.006F T T 0.004T F F 0.048T F T 0.012T T F 0.032T T T 0.008

  • *Conditional IndependenceA and P are independent given CPr(A | P,C) = Pr(A | C) and also Pr(P | A,C) = Pr(P | C)

  • *OutlineStatistics BasicsNaive BayesBayesian NetworkApplications

  • *Nave Bayesian ModelsTwo assumptions: Attributes areequally importantstatistically independent (given the class value)This means that knowledge about the value of a particular attribute doesnt tell us anything about the value of another attribute (if the class is known)Although based on assumptions that are almost never correct, this scheme works well in practice!

  • *Why Nave?Assume the attributes are independent, given classWhat does that mean?playoutlooktemphumiditywindyPr(outlook=sunny | windy=true, play=yes)= Pr(outlook=sunny|play=yes)

  • *Weather data set

    OutlookWindyPlayovercastFALSEyesrainyFALSEyesrainyFALSEyesovercastTRUEyessunnyFALSEyesrainyFALSEyessunnyTRUEyesovercastTRUEyesovercastFALSEyes

  • *Is the assumption satisfied?#yes=9#sunny=2#windy, yes=3#sunny|windy, yes=1

    Pr(outlook=sunny|windy=true, play=yes)=1/3

    Pr(outlook=sunny|play=yes)=2/9

    Pr(windy|outlook=sunny,play=yes)=1/2

    Pr(windy|play=yes)=3/9

    Thus, the assumption is NOT satisfied.But, we can tolerate some errors (see later slides)

    OutlookWindyPlayovercastFALSEyesrainyFALSEyesrainyFALSEyesovercastTRUEyessunnyFALSEyesrainyFALSEyessunnyTRUEyesovercastTRUEyesovercastFALSEyes

  • *Probabilities for the weather data

    A new day:

    OutlookTemperatureHumidityWindyPlayYesNoYesNoYesNoYesNoYesNoSunny23Hot22High34False6295Overcast40Mild42Normal61True33Rainy32Cool31Sunny2/93/5Hot2/92/5High3/94/5False6/92/59/145/14Overcast4/90/5Mild4/92/5Normal6/91/5True3/93/5Rainy3/92/5Cool3/91/5

    OutlookTemp.HumidityWindyPlaySunnyCoolHighTrue?

  • *

    Likelihood of the two classesFor yes = 2/9 3/9 3/9 3/9 9/14 = 0.0053For no = 3/5 1/5 4/5 3/5 5/14 = 0.0206Conversion into a probability by normalization:P(yes|E) = 0.0053 / (0.0053 + 0.0206) = 0.205P(no|Eh) = 0.0206 / (0.0053 + 0.0206) = 0.795

  • *Bayes ruleProbability of event H given evidence E:

    A priori probability of H:Probability of event before evidence has been seenA posteriori probability of H:Probability of event after evidence has been seen

  • *Nave Bayes for classificationClassification learning: whats the probability of the class given an instance? Evidence E = an instanceEvent H = class value for instance (Play=yes, Play=no)Nave Bayes Assumption: evidence can be split into independent parts (i.e. attributes of instance are independent)

  • *The weather data exampleEvidence EProbability forclass yes

    OutlookTemp.HumidityWindyPlaySunnyCoolHighTrue?

  • *The zero-frequency problemWhat if an attribute value doesnt occur with every class value (e.g. Humidity = high for class yes)?Probability will be zero!A posteriori probability will also be zero!(No matter how likely the other values are!) Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator)Result: probabilities will never be zero! (also: stabilizes probability estimates)

  • *Modified probability estimatesIn some cases adding a constant different from 1 might be more appropriateExample: attribute outlook for class yes

    Weights dont need to be equal (if they sum to 1)SunnyOvercastRainy

  • *Missing valuesTraining: instance is not included in frequency count for attribute value-class combinationClassification: attribute will be omitted from calculationExample:

    OutlookTemp.HumidityWindyPlay?CoolHighTrue?

    Likelihood of yes = 3/9 3/9 3/9 9/14 = 0.0238Likelihood of no = 1/5 4/5 3/5 5/14 = 0.0343P(yes) = 0.0238 / (0.0238 + 0.0343) = 41%P(no) = 0.0343 / (0.0238 + 0.0343) = 59%

  • *Dealing with numeric attributesUsual assumption: attributes have a normal or Gaussian probability distribution (given the class)The probability density function for the normal distribution is defined by two parameters:The sample mean :

    The standard deviation :

    The density function f(x):

  • *Statistics for the weather dataExample density value:

    OutlookTemperatureHumidityWindyPlayYesNoYesNoYesNoYesNoYesNoSunny2383858685False6295Overcast4070809690True33Rainy3268658070Sunny2/93/5mean7374.6mean79.186.2False6/92/59/145/14Overcast4/90/5std dev6.27.9std dev10.29.7True3/93/5Rainy3/92/5

  • *Classifying a new dayA new day:

    Missing values during training: not included in calculation of mean and standard deviation

    OutlookTemp.HumidityWindyPlaySunny6690true?

    Likelihood of yes = 2/9 0.0340 0.0221 3/9 9/14 = 0.000036Likelihood of no = 3/5 0.0291 0.0380 3/5 5/14 = 0.000136P(yes) = 0.000036 / (0.000036 + 0. 000136) = 20.9%P(no) = 0. 000136 / (0.000036 + 0. 000136) = 79.1%

  • *Probability densitiesRelationship between probability and density:

    But: this doesnt change calculation of a posteriori probabilities because cancels out Exact relationship:

  • *Example of Nave Bayes in WekaUse Weka Nave Bayes Module to classify Weather.nominal.arff

  • *

  • *

  • *

  • *

  • *Discussion of Nave BayesNave Bayes works surprisingly well (even if independence assumption is clearly violated)Why? Because classification doesnt require accurate probability estimates as long as maximum probability is assigned to correct classHowever: adding too many redundant attributes will cause problems (e.g. identical attributes)Note also: many numeric attributes are not normally distributed

  • *OutlineStatistics BasicsNaive BayesBayesian NetworkApplications

  • *Conditional IndependenceCan encode joint probability distribution in compact formC A P ProbabilityF F F 0.534F F T 0.356F T F 0.006F T T 0.004T F F 0.012T F T 0.048T T F 0.008T T T 0.032

    CavityProbeCatchesAcheConditional probability table (CPT)

  • *Creating a Network1: Bayes net = representation of a JPD2: Bayes net = set of cond. independence statementsIf create correct structure that represents causalityThen get a good networki.e. one thats small = easy to compute withOne that is easy to fill in numbers

  • *ExampleMy house alarm system just sounded (A).Both an earthquake (E) and a burglary (B) could set it off.John will probably hear the alarm; if so hell call (J).But sometimes John calls even when the alarm is silentMary might hear the alarm and call too (M), but not as reliablyWe could be assured a complete and consistent model by fully specifying the joint distribution:Pr(A, E, B, J, M)Pr(A, E, B, J, ~M)etc.

  • *Structural ModelsInstead of starting with numbers, we will start with structural relationships among the variables

    There is a direct causal relationship from Earthquake to AlarmThere is a direct causal relationship from Burglar to Alarm There is a direct causal relationship from Alarm to JohnCallEarthquake and Burglar tend to occur independentlyetc.

  • *Possible Bayesian Network

  • *Graphical Models and Problem ParametersWhat probabilities need I specify to ensure a complete, consistent model giventhe variables I have identifiedthe dependence and independence relationships I have specified by building a graph structure

    Answer provide an unconditional (prior) probability for every node in the graph with no parentsfor all remaining, provide a conditional probability tableProb(Child | Parent1, Parent2, Parent3) for all possible combination of Parent1, Parent2, Parent3 values

  • *Complete Bayesian NetworkBurglaryMaryCallsJohnCallsAlarmEarthquake

  • *Microsoft Bayesian Belief Nethttp://research.microsoft.com/adapt/MSBNx/Can be used to construct and reason with Bayesian NetworksConsider the example

  • *

  • *

  • *

  • *

  • *Mining for Structural Models Difficult to mineSome methods are proposedUp to now, no good results yetOften requires domain experts knowledgeOnce set up, a Bayesian Network can be used to provide probab