Mitchell Capitulos Extra

Embed Size (px)

Citation preview

  • 8/15/2019 Mitchell Capitulos Extra

    1/30

    CHAPTER 2

    Estimating Probabilities

     Machine LearningCopyright   c 2016. Tom M. Mitchell. All rights reserved.

    *DRAFT OF January 24, 2016*

    *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR’S

    PERMISSION*

    This is a rough draft chapter intended for inclusion in the upcoming second

    edition of the textbook  Machine Learning, T.M. Mitchell, McGraw Hill.

    You are welcome to use this for educational purposes, but do not duplicate

    or repost it on the internet. For online copies of this and other materials

    related to this book, visit the web site www.cs.cmu.edu/ ∼tom/mlbook.html.Please send suggestions for improvements, or suggested exercises, to

    [email protected].

    Many machine learning methods depend on probabilistic approaches. The

    reason is simple: when we are interested in learning some target function

     f   : X  → Y , we can more generally learn the probabilistic function  P(Y | X ).By using a probabilistic approach, we can design algorithms that learn func-

    tions with uncertain outcomes (e.g., predicting tomorrow’s stock price) and

    that incorporate prior knowledge to guide learning (e.g., a bias that tomor-

    row’s stock price is likely to be similar to today’s price). This chapter de-

    scribes joint probability distributions over many variables, and shows how

    they can be used to calculate a target  P(Y | X ). It also considers the problemof learning, or estimating, probability distributions from training data, pre-

    senting the two most common approaches: maximum likelihood estimation

    and maximum a posteriori estimation.

    1 Joint Probability Distributions

    The key to building probabilistic models is to define a set of random variables,

    and to consider the joint probability distribution over them. For example, Table

    1 defines a joint probability distribution over three random variables: a person’s

    1

  • 8/15/2019 Mitchell Capitulos Extra

    2/30

    Copyright   c 2016, Tom M. Mitchell.   2

    Gender HoursWorked Wealth probability

    female  

  • 8/15/2019 Mitchell Capitulos Extra

    3/30

    Copyright   c 2016, Tom M. Mitchell.   3

    Gender=female) = 0.0362, by summing the two table rows that satisfy this joint assignment.

    •   Any conditional probability defined over subsets of the variables. Recall

    the definition of conditional probability  P(Y | X ) = P( X  ∧Y )/P( X ). We cancalculate both the numerator and denominator in this definition by sum-ming appropriate rows, to obtain the conditional probability. For example,

    according to Table 1,  P(Wealth=rich|Gender=female) =  0.0362/0.3315 =0.1092.

    To summarize, if we know the joint probability distribution over an arbi-

    trary set of random variables  { X 1 . . . X n}, then we can calculate the conditionaland joint probability distributions for arbitrary subsets of these variables (e.g.,

    P( X n| X 1 . . . X n−1)). In theory, we can in this way solve any classification, re-gression, or other function approximation problem defined over these variables,

    and furthermore produce probabilistic rather than deterministic predictions forany given input to the target function.1 For example, if we wish to learn to

    predict which people are rich or poor based on their gender and hours worked,

    we can use the above approach to simply calculate the probability distribution

    P(Wealth |  Gender, HoursWorked).

    1.1 Learning the Joint Distribution

    How can we learn joint distributions from observed training data? In the example

    of Table 1 it will be easy if we begin with a large database containing, say, descrip-

    tions of a million people in terms of their values for our three variables. Given a

    large data set such as this, one can easily estimate a probability for each row in the

    table by calculating the fraction of database entries (people) that satisfy the joint

    assignment specified for that row. If thousands of database entries fall into each

    row, we will obtain highly reliable probability estimates using this strategy.

    In other cases, however, it can be difficult to learn the joint distribution due to

    the very large amount of training data required. To see the point, consider how our

    learning problem would change if we were to add additional variables to describe

    a total of 100 boolean features for each person in Table 1 (e.g., we could add ”do

    they have a college degree?”, ”are they healthy?”). Given 100 boolean features,

    the number of rows in the table would now expand to 2 100, which is greater than

    1030

    . Unfortunately, even if our database describes every single person on earthwe would not have enough data to obtain reliable probability estimates for most

    rows. There are only approximately 1010 people on earth, which means that for

    most of the 1030 rows in our table, we would have zero training examples! This

    is a significant problem given that real-world machine learning applications often

    1Of course if our random variables have continuous values instead of discrete, we would need

    an infinitely large table. In such cases we represent the joint distribution by a function instead of a

    table, but the principles for using the joint distribution remain unchanged.

  • 8/15/2019 Mitchell Capitulos Extra

    4/30

    Copyright   c 2016, Tom M. Mitchell.   4

    use many more than 100 features to describe each example, and that learning such

    probability terms is central to probabilistic machine learning algorithms.

    To successfully address the issue of learning probabilities from available train-

    ing data, we must (1) be smart about how we estimate probability parameters from

    available data, and (2) be smart about how we represent joint probability distribu-tions.

    2 Estimating Probabilities

    Let us begin our discussion of how to estimate probabilities with a simple exam-

    ple, and explore two intuitive algorithms. It will turn out that these two intuitive

    algorithms illustrate the two primary approaches used in nearly all probabilistic

    machine learning algorithms.

    In this simple example you have a coin, represented by the random variable

     X . If you flip this coin, it may turn up heads (indicated by X  =  1) or tails ( X  =  0).The learning task is to estimate the probability that it will turn up heads; that is, to

    estimate P( X  = 1). We will use θ  to refer to the true (but unknown) probability of heads (e.g.,  P( X  = 1) = θ), and use  θ̂ to refer to our learned estimate of this trueθ. You gather training data by flipping the coin  n  times, and observe that it turnsup heads α1  times, and tails α0  times. Of course n = α1 +α0.

    What is the most intuitive approach to estimating  θ = P( X =1) from this train-ing data? Most people immediately answer that we should estimate the probability

    by the fraction of flips that result in heads:

    Probability estimation Algorithm 1 (maximum likelihood). Given

    observed training data producing α1 total ”heads,” and α0 total ”tails,”output the estimate

    θ̂ =  α1α1 +α0

    For example, if we flip the coin 50 times, observing 24 heads and 26 tails, then

    we will estimate θ̂ = 0.48.This approach is quite reasonable, and very intuitive. It is a good approach

    when we have plenty of training data. However, notice that if the training data is

    very scarce it can produce unreliable estimates. For example, if we observe only

    3 flips of the coin, we might observe α1 =  1 and α0 =  2, producing the estimateθ̂ = 0.33. How would we respond to this? If we have prior knowledge about the

    coin – for example, if we recognize it as a government minted coin which is likelyto have θ  close to 0.5 – then we might respond by still believing the probability iscloser to 0.5 than to the algorithm 1 estimate  θ̂= 0.33. This leads to our secondintuitive algorithm: an algorithm that enables us to incorporate prior assumptions

    along with observed training data to produce our final estimate. In particular,

    Algorithm 2 allows us to express our prior assumptions or knowledge about the

    coin by adding in any number of  imaginary coin flips resulting in heads or tails.

    We can use this option of introducing  γ 1  imaginary heads, and γ 0  imaginary tails,to express our prior assumptions:

  • 8/15/2019 Mitchell Capitulos Extra

    5/30

    Copyright   c 2016, Tom M. Mitchell.   5

    Probability estimation Algorithm 2. (maximum a posteriori prob-

    ability). Given observed training data producingα1 observed ”heads,”and α0 observed ”tails,” plus prior information expressed by introduc-ing γ 1 imaginary ”heads” and γ 0 imaginary ”tails,” output the estimate

    θ̂ =  (α1 +γ 1)

    (α1 +γ 1) + (α0 +γ 0)

    Note that Algorithm 2, like Algorithm 1, produces an estimate based on the

    proportion of coin flips that result in ”heads.” The only difference is that Algo-

    rithm 2 allows including optional imaginary flips that represent our prior assump-

    tions about θ, in addition to actual observed data. Algorithm 2 has several attrac-tive properties:

    •  It is easy to incorporate our prior assumptions about the  value  of  θ  by ad-

     justing the  ratio   of  γ 1   to  γ 0. For example, if we have reason to assumethat  θ  =  0.7 we can add in  γ 1 =  7 imaginary flips with  X  = 1, and γ 0 =  3imaginary flips for X  =  0.

    •   It is easy to express our  degree of certainty  about our prior knowledge, byadjusting the total  volume  of imaginary coin flips. For example, if we are

    highly certain of our prior belief that  θ  = 0.7, then we might use priors of γ 1  =  700 and  γ 0  =  300 instead of  γ 1  =  7 and  γ 0  = 3. By increasing thevolume of imaginary examples, we effectively require a greater volume of 

    contradictory observed data in order to produce a final estimate far from our

    prior assumed value.

    •   If we set γ 1 = γ 0 = 0, then Algorithm 2 produces exactly the same estimateas Algorithm 1. Algorithm 1 is just a special case of Algorithm 2.

    •   Asymptotically, as the volume of actual observed data grows toward infin-ity, the influence of our imaginary data goes to zero (the fixed number of 

    imaginary coin flips becomes insignificant compared to a sufficiently large

    number of actual observations). In other words, Algorithm 2 behaves so

    that priors have the strongest influence when observations are scarce, and

    their influence gradually reduces as observations become more plentiful.

    Both Algorithm 1 and Algorithm 2 are intuitively quite compelling. In fact,these two algorithms exemplify the two most widely used approaches to machine

    learning of probabilistic models from training data.  They can be shown to follow

    from two different underlying principles. Algorithm 1 follows a principle called

     Maximum Likelihood Estimation (MLE), in which we seek an estimate of  θ  thatmaximizes the probability of the observed data.   In fact we can prove (and will,

    below) that Algorithm 1 outputs an estimate of  θ   that makes the observed datamore probable than any other possible estimate of  θ. Algorithm 2 follows a dif-ferent principle called Maximum a Posteriori (MAP) estimation, in which we seek 

  • 8/15/2019 Mitchell Capitulos Extra

    6/30

    Copyright   c 2016, Tom M. Mitchell.   6

    Figure 1:  MLE and MAP estimates of  θ  as the number of coin flips grows. Data was

    generated by a random number generator that output a value of 1 with probability  θ = 0.3,and a value of 0 with probability of  (1 −θ) = 0.7. Each plot shows the two estimates of  θas the number of observed coin flips grows. Plots on the left correspond to values of  γ 1 and

    γ 0  that reflect the correct prior assumption about the value of  θ, plots on the right reflectthe incorrect prior assumption that  θ   is most probably 0.4. Plots in the top row reflectlower confidence in the prior assumption, by including only 60 = γ 1 + γ 0   imaginary datapoints, whereas bottom plots assume 120. Note as the size of the data grows, the MLE

    and MAP estimates converge toward each other, and toward the correct estimate for  θ.

    the estimate of  θ  that is most probable, given the observed data, plus background assumptions about its value.  Thus, the difference between these two principles

    is that Algorithm 2 assumes background knowledge is available, whereas Algo-

    rithm 1 does not. Both principles have been widely used to derive and to justifya vast range of machine learning algorithms, from Bayesian networks, to linear

    regression, to neural network learning. Our coin flip example represents just one

    of many such learning problems.

    The experimental behavior of these two algorithms is shown in Figure 1. Here

    the learning task is to estimate the unknown value of  θ = P( X  =  1) for a boolean-valued random variable  X , based on a sample of  n  values of  X  drawn indepen-

    dently (e.g.,   n   independent flips of a coin with probability  θ   of heads). In thisfigure, the true value of  θ  is 0.3, and the same sequence of training examples is

  • 8/15/2019 Mitchell Capitulos Extra

    7/30

    Copyright   c 2016, Tom M. Mitchell.   7

    used in each plot. Consider first the plot in the upper left. The blue line shows

    the estimates of  θ  produced by Algorithm 1 (MLE) as the number  n  of trainingexamples grows. The red line shows the estimates produced by Algorithm 2, us-

    ing the same training examples and using priors γ 0 =  42 and γ 1 =  18. This prior

    assumption aligns with the correct value of  θ   (i.e.,   [γ 1/(γ 1 + γ 0)] = 0.3). Notethat as the number of training example coin flips grows, both algorithms converge

    toward the correct estimate of  θ, though Algorithm 2 provides much better esti-mates than Algorithm 1 when little data is available. The bottom left plot shows

    the estimates if Algorithm 2 uses even more confident priors, captured by twice

    as many hallucinated examples (γ 0 = 84 and γ 1 =  36). The two plots on the rightside of the figure show the estimates produced when Algorithm 2 (MAP) uses in-

    correct priors (where [γ 1/(γ 1 + γ 0)] = 0.4). The difference between the top rightand bottom right plots is again only a difference in the number of hallucinated

    examples, reflecting the difference in confidence that  θ  should be close to 0.4.

    2.1 Maximum Likelihood Estimation (MLE)

    Maximum Likelihood Estimation, often abbreviated MLE, estimates one or more

    probability parameters θ  based on the principle that if we observe training data D,we should choose the value of  θ  that makes  D   most probable. When applied tothe coin flipping problem discussed above, it yields Algorithm 1. The definition

    of the MLE in general is

    θ̂ MLE  = argmaxθ

    P( D|θ)   (1)

    The intuition underlying this principle is simple: we are more likely to observe

    data D  if we are in a world where the appearance of this data is highly probable.

    Therefore, we should estimate  θ   by assigning it whatever value maximizes theprobability of having observed D.

    Beginning with this principle for choosing among possible estimates of  θ, itis possible to mathematically derive a formula for the value of  θ   that provablymaximizes P( D|θ). Many machine learning algorithms are defined so that theyprovably learn a collection of parameter values that follow this maximum likeli-

    hood principle. Below we derive Algorithm 1 for our above coin flip example,

    beginning with the maximum likelihood principle.

    To precisely define our coin flipping example, let   X   be a random variable

    which can take on either value 1 or 0, and let θ  = P( X  = 1) refer to the true, butpossibly unknown, probability that a random draw of  X  will take on the value 1.2

    Assume we flip the coin  X  a number of times to produce training data  D, in which

    we observe  X  = 1 a total of  α1  times, and  X  = 0 a total of  α0   times. We furtherassume that the outcomes of the flips are independent (i.e., the result of one coin

    flip has no influence on other coin flips), and identically distributed (i.e., the same

    value of  θ  governs each coin flip). Taken together, these assumptions are that the

    2A random variable defined in this way is called a Bernoulli random variable, and the proba-

    bility distribution it follows, defined by θ, is called a Bernoulli distribution.

  • 8/15/2019 Mitchell Capitulos Extra

    8/30

    Copyright   c 2016, Tom M. Mitchell.   8

    coin flips are independent, identically distributed (which is often abbreviated to

    ”i.i.d.”).

    The maximum likelihood principle involves choosing θ  to maximize  P( D|θ).Therefore, we must begin by writing an expression for  P( D|θ), or equivalently

    P(α1,α0|θ)  in terms of  θ, then find an algorithm that chooses a value for  θ  thatmaximizes this quantify. To begin, note that if data  D consists of just one coin flip,

    then P( D|θ) = θ  if that one coin flip results in  X  =  1, and P( D|θ) = (1−θ) if theresult is instead  X  = 0. Furthermore, if we observe a set of i.i.d. coin flips suchas D = 1,1,0,1,0, then we can easily calculate P( D|θ) by multiplying togetherthe probabilities of each individual coin flip:

    P( D = 1,1,0,1,0|θ) = θ ·θ · (1−θ) ·θ · (1−θ) = θ3 · (1−θ)2

    In other words, if we summarize  D by the total number of observed times α1 when X  = 1 and α0  when X  = 0, we have in general

    P( D = α1,α0|θ) = θα1 (1−θ)α0 (2)

    The above expression gives us a formula for   P( D = α1,α0|θ). The quantityP( D|θ) is often called the likelihood function because it expresses the probabilityof the observed data D as a function of  θ. This likelihood function is often written

     L(θ) = P( D|θ).Our final step in this derivation is to determine the value of  θ  that maximizes

    the likelihood function  P( D = α1,α0|θ). Notice that maximizing P( D|θ) withrespect to θ is equivalent to maximizing its logarithm, ln P( D|θ) with respect to θ,because ln( x) increases monotonically with x:

    argmaxθ

    P( D|θ) =   argmaxθ

    ln P( D|θ)

    It often simplifies the mathematics to maximize ln P( D|θ) rather than  P( D|θ), asis the case in our current example. In fact, this log likelihood is so common that it

    has its own notation, (θ) = ln P( D|θ).To find the value of  θ that maximizes ln P( D|θ), and therefore also maximizes

    P( D|θ), we can calculate the derivative of ln P( D = α1,α0|θ) with respect to θ,then solve for the value of  θ   that makes this derivative equal to zero. First, wecalculate the derivative of the log of the likelihood function of Eq. (2):

    ∂(θ)

    ∂θ  =

      ∂ ln P( D|θ)

    ∂θ

    =  ∂   ln[θα1 (1−θ)α0 ]

    ∂θ

    =  ∂ [α1 lnθ+α0 ln(1−θ)]

    ∂θ

    =   α1∂   lnθ

    ∂θ  +α0

    ∂ ln(1−θ)

    ∂θ

    =   α1∂   lnθ

    ∂θ  +α0

    ∂ ln(1−θ)

    ∂(1 −θ)  · ∂(1 −θ)

    ∂θ

    ∂(θ)

    ∂θ  =   α1

    1

    θ +α0

    1

    (1 −θ) · (−1)   (3)

  • 8/15/2019 Mitchell Capitulos Extra

    9/30

  • 8/15/2019 Mitchell Capitulos Extra

    10/30

    Copyright   c 2016, Tom M. Mitchell.   10

    To produce a MAP estimate for  θ  we must specify a prior distribution  P(θ)that summarizes our a priori assumptions about the value of  θ. In the case wheredata is generated by multiple i.i.d. draws of a Bernoulli random variable, as in our

    coin flip example, the most common form of prior is a Beta distribution:

    P(θ) = Beta(β0,β1) = θβ1−1 (1 −θ)β0−1

     B(β0,β1)  (6)

    Here β0 and  β1 are parameters whose values we must specify in advance to definea specific  P(θ). As we shall see, choosing values for β0   and  β1  corresponds tochoosing the number of imaginary examples  γ 0   and  γ 1   in the above Algorithm2. The denominator  B(β0,β1) is a normalization term defined by the function  B,which assures the probability integrates to one, but which is independent of  θ.

    As defined in Eq. (5), the MAP estimate involves choosing the value of  θ  thatmaximizes P( D|θ)P(θ). Recall we already have an expression for  P( D|θ) in Eq.

    (2). Combining this with the above expression for P(θ) we have:

    θ̂ MAP =   argmaxtheta

    P( D|θ)P(θ)

    =   argmaxθ

    θα1 (1−θ)α0  θβ1−1 (1 −θ)β0−1

     B(β0,β1)

    =   argmaxθ

    θα1+β1−1 (1 −θ)α0+β0−1

     B(β0,β1)

    =   argmaxθ

    θα1+β1−1 (1 −θ)α0+β0−1 (7)

    where the final line follows from the previous line because  B(β0,β1)  is indepen-dent of  θ.

    How can we solve for the value of  θ that maximizes the expression in Eq. (7)?Fortunately, we have already answered this question! Notice that the quantity we

    seek to maximize in Eq. (7) can be made identical to the likelihood function in Eq.

    (2) if we substitute  (α1 +β1 − 1) for α1   in Eq. (2), and substitute (α0 +β0 − 1)for α0. We can therefore reuse the derivation of  θ̂

     MLE  beginning from Eq. (2) and

    ending with Eq. (4), simply by carrying through this substitution. Applying this

    same substitution to Eq. (4) implies the solution to Eq. (7) is therefore

    ˆθ

     MAP

    = argmaxθ P( D|θ)P(θ) =

      (α1 +β1 − 1)

    (α1 +β1 − 1) + (α0 +β0 − 1)   (8)

    Thus, we have derived in Eq. (8) the intuitive Algorithm 2 for estimating  θ,starting from the principle that we want to choose the value of  θ  that maximizesP(θ| D). The number γ 1  of imaginary ”heads” in Algorithm 2 is equal to  β1 − 1,and the number  γ 0  of imaginary ”tails” is equal to β0 − 1. This same maximuma posteriori probability principle is used as the basis for deriving many machine

    learning algorithms for more complex problems where the solution is not so intu-

    itively obvious as it is in our coin flipping example.

  • 8/15/2019 Mitchell Capitulos Extra

    11/30

    Copyright   c 2016, Tom M. Mitchell.   11

    3 Notes on Terminology

    A boolean valued random variable  X  ∈ {0,1}, governed by the probability distri-bution  P( X  = 1) = θ;   P( X  =0) = (1−θ) is called a  Bernoulli random variable,

    and this probability distribution is called a  Bernoulli distribution. A convenientmathematical expression for a Bernoulli distribution  P( X ) is:

    P( X  =  x) =   θ x · (1−θ)1− x

    The Beta(β0,β1) distribution defined in Eq. (6) is called the conjugate priorfor the binomial likelihood function θα1 (1−θ)α0 , because the posterior distribu-tion P( D|θ)P(θ) is also a Beta distribution. More generally, any  P(θ) is called theconjugate prior for a likelihood function  L(θ) = P( D|θ) if the posterior  P(θ| D) isof the same form as  P(θ).

    4 What You Should Know

    The main points of this chapter include:

    •  Joint probability distributions lie at the core of probabilistic machine learn-ing approaches. Given the joint probability distribution P( X 1 . . . X n) over aset of random variables, it is possible in principle to compute  any  joint or

    conditional probability defined over  any  subset of these variables.

    •  Learning, or estimating, the joint probability distribution from training datacan be easy if the data set is large compared to the number of distinct prob-

    ability terms we must estimate. But in many practical problems the data

    is more sparse, requiring methods that rely on prior knowledge or assump-

    tions, in addition to observed data.

    •  Maximum likelihood estimation (MLE) is one of two widely used principlesfor estimating the parameters that define a probability distribution. This

    principle is to choose the set of parameter values  θ̂ MLE  that makes the ob-served training data most probable (over all the possible choices of  θ):

    θ̂ MLE  = argmaxθ

    P(data|θ)

    In many cases, maximum likelihood estimates correspond to the intuitive

    notion that we should base probability estimates on observed ratios. For

    example, given the problem of estimating the probability that a coin will

    turn up heads, given  α1  observed flips resulting in heads, and  α0  observedflips resulting in tails, the maximum likelihood estimate corresponds exactly

    to taking the fraction of flips that turn up heads:

    θ̂ MLE  = argmaxθ

    P(data|θ) =  α1α1 +α0

  • 8/15/2019 Mitchell Capitulos Extra

    12/30

    Copyright   c 2016, Tom M. Mitchell.   12

    •  Maximium a posteriori probability (MAP) estimation is the other of the twowidely used principles. This principle is to choose the most probable value

    of  θ, given the observed training data plus a prior probability distributionP(θ) which captures prior knowledge or assumptions about the value of  θ:

    θ̂ MAP = argmaxθ

    P(θ|data) =   argmaxθ

    P(data|θ)P(θ)

    In many cases, MAP estimates correspond to the intuitive notion that we

    can represent prior assumptions by making up ”imaginary” data which re-

    flects these assumptions. For example, the MAP estimate for the above coin

    flip example, assuming a prior  P(θ) =  Beta(γ 0 + 1,γ 1 + 1), yields a MAPestimate which is equivalent to the MLE estimate if we simply add in an

    imaginary γ 1 heads and γ 0 tails to the actual observed α1 heads and α0 tails:

    θ̂ MAP = argmaxθ

    P(data|θ)P(θ) =  (α1 +γ 1)

    (α1 +γ 1) + (α0 +γ 0)

    EXERCISES

    1. In the MAP estimation of  θ   for our Bernoulli random variable   X   in thischapter, we used a Beta(β0,β1) prior probability distribution to capture ourprior beliefs about the prior probability of different values of  θ, before see-ing the observed data.

    •   Plot this prior probability distribution over   θ, corresponding to thenumber of hallucinated examples used in the top left plot of Figure

    1 (i.e., γ 0 =  42,γ 1 =  18). Specifically create a plot showing the priorprobability (vertical axis) for each possible value of  θ between 0 and 1(horizontal axis), as represented by the prior distribution Beta(β0,β1).Recall the correspondence βi =  γ i + 1. Note you will want to write asimple computer program to create this plot.

    •   Above, you plotted the   prior   probability over possible values of  θ.Now plot the  posterior  probability distribution over θ  given that prior,plus observed data in which 6 heads ( X  =   1) were observed, alongwith 9 tails ( X  =  0).

    •  View the plot you created above to visually determine the approximate

    Maximum a Posterior probability estimate θ MAP. What is it? What isthe  exact  value of the MAP estimate? What is the exact value of the

    Maximum Likelihood Estimate θ MLE ?

    5 Acknowledgements

    I very much appreciate receiving helpful comments on earlier drafts of this chapter

    from Akshay Mishra.

  • 8/15/2019 Mitchell Capitulos Extra

    13/30

    Copyright   c 2016, Tom M. Mitchell.   13

    REFERENCES

    Mitchell, T (1997).  Machine Learning, McGraw Hill.

    Wasserman, L. (2004). All of Statistics, Springer-Verlag.

  • 8/15/2019 Mitchell Capitulos Extra

    14/30

    CHAPTER 3

    GENERATIVE AND DISCRIMINATIVE

    CLASSIFIERS:

    NAIVE BAYES AND LOGISTIC REGRESSION

     Machine LearningCopyright   c2015. Tom M. Mitchell. All rights reserved.

    *DRAFT OF February 15, 2016*

    *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR’S PERMISSION*

    This is a rough draft chapter intended for inclusion in the upcoming second edi-

    tion of the textbook   Machine Learning, T.M. Mitchell, McGraw Hill. You are

    welcome to use this for educational purposes, but do not duplicate or repost it

    on the internet. For online copies of this and other materials related to this book,

    visit the web site www.cs.cmu.edu/ ∼tom/mlbook.html.Please send suggestions for improvements, or suggested exercises, to

    [email protected].

    1 Learning Classifiers based on Bayes Rule

    Here we consider the relationship between supervised learning, or function ap-

    proximation problems, and Bayesian reasoning. We begin by considering how to

    design learning algorithms based on Bayes rule.

    Consider a supervised learning problem in which we wish to approximate anunknown target function   f   : X  → Y , or equivalently  P(Y | X ). To begin, we willassume  Y   is a boolean-valued random variable, and   X   is a vector containing  n

    boolean attributes. In other words, X  =  X 1, X 2 . . . , X n, where  X i  is the booleanrandom variable denoting the ith attribute of  X .

    Applying Bayes rule, we see that P(Y  = yi| X ) can be represented as

    P(Y  = yi| X  =  xk ) =   P( X  =  xk |Y  = yi)P(Y  = yi)∑ j P( X  =  xk |Y  = y j)P(Y  = y j)

    1

  • 8/15/2019 Mitchell Capitulos Extra

    15/30

    Copyright   c 2015, Tom M. Mitchell.   2

    where ym  denotes the  mth possible value for Y , xk  denotes the  k th possible vector

    value for X , and where the summation in the denominator is over all legal values

    of the random variable Y .

    One way to learn  P(Y | X )   is to use the training data to estimate  P( X |Y )  andP(Y ). We can then use these estimates, together with Bayes rule above, to deter-mine P(Y | X  =  xk ) for any new instance  xk .

    A NOTE ON NOTATION: We will consistently use upper case symbols (e.g.,

     X ) to refer to random variables, including both vector and non-vector variables. If 

     X  is a vector, then we use subscripts (e.g., X i  to refer to each random variable, or

    feature, in X ). We use lower case symbols to refer to  values of random variables

    (e.g., X i = xi j may refer to random variable X i taking on its   jth possible value). Wewill sometimes abbreviate by omitting variable names, for example abbreviating

    P( X i =  xi j|Y  = yk ) to  P( xi j| yk ). We will write  E [ X ] to refer to the expected valueof  X . We use superscripts to index training examples (e.g.,  X 

     j

    i refers to the value

    of the random variable  X i   in the   jth training example.). We use  δ( x)  to denotean “indicator” function whose value is 1 if its logical argument   x   is true, and

    whose value is 0 otherwise. We use the # D{ x}  operator to denote the number of elements in the set D  that satisfy property x. We use a “hat” to indicate estimates;

    for example,  θ̂ indicates an estimated value of  θ.

    1.1 Unbiased Learning of Bayes Classifiers is Impractical

    If we are going to train a Bayes classifier by estimating  P( X |Y )  and  P(Y ), thenit is reasonable to ask how much training data will be required to obtain reliable

    estimates of these distributions. Let us assume training examples are generatedby drawing instances at random from an unknown underlying distribution  P( X ),then allowing a teacher to label this example with its Y  value.

    A hundred independently drawn training examples will usually suffice to ob-

    tain a maximum likelihood estimate of  P(Y ) that is within a few percent of its cor-rect value1 when Y  is a boolean variable. However, accurately estimating  P( X |Y )typically requires many more examples. To see why, consider the number of pa-

    rameters we must estimate when  Y   is boolean and   X   is a vector of   n   boolean

    attributes. In this case, we need to estimate a set of parameters

    θi j

     ≡P( X  =  xi

    |Y  = y j)

    where the index  i  takes on 2n possible values (one for each of the possible vector

    values of  X ), and   j takes on 2 possible values. Therefore, we will need to estimate

    approximately 2n+1 parameters. To calculate the exact number of required param-

    eters, note for any fixed   j, the sum over  i  of  θi j  must be one. Therefore, for anyparticular value y j, and the 2

    n possible values of  xi, we need compute only 2n−1

    independent parameters. Given the two possible values for  Y , we must estimate

    a total of 2(2n − 1)  such θi j  parameters. Unfortunately, this corresponds to two1Why? See Chapter 5 of edition 1 of  Machine Learning.

  • 8/15/2019 Mitchell Capitulos Extra

    16/30

    Copyright   c 2015, Tom M. Mitchell.   3

    distinct parameters for each  of the distinct instances in the instance space for  X .

    Worse yet, to obtain reliable estimates of each of these parameters, we will need to

    observe each of these distinct instances multiple times! This is clearly unrealistic

    in most practical learning domains. For example, if  X   is a vector containing 30

    boolean features, then we will need to estimate more than 3 billion parameters.

    2 Naive Bayes Algorithm

    Given the intractable sample complexity for learning Bayesian classifiers, we must

    look for ways to reduce this complexity. The Naive Bayes classifier does this

    by making a conditional independence assumption that dramatically reduces the

    number of parameters to be estimated when modeling  P( X |Y ), from our original2(2n−1) to just 2n.

    2.1 Conditional Independence

     Definition: Given three sets of random variables X ,Y  and Z , we say X is conditionally independent of  Y   given Z , if and only if the proba-

    bility distribution governing X  is independent of the value of  Y   given

     Z ; that is

    (∀i, j, k )P( X  =  xi|Y  = y j, Z  =  zk ) = P( X  =  xi| Z  = zk )

    As an example, consider three boolean random variables to describe the current

    weather: Rain, T hunder  and  Lightning. We might reasonably assert that T hunder 

    is independent of   Rain   given   Lightning. Because we know   Lightning   causes

    Thunder , once we know whether or not there is   Lightning, no additional infor-

    mation about  Thunder   is provided by the value of   Rain. Of course there is a

    clear dependence of  Thunder  on  Rain  in general, but there is no  conditional de-

    pendence once we know the value of   Lightning. Although X ,  Y   and  Z  are each

    single random variables in this example, more generally the definition applies to

    sets of random variables. For example, we might assert that variables { A, B}  areconditionally independent of {C , D} given variables { E , F }.

    2.2 Derivation of Naive Bayes Algorithm

    The Naive Bayes algorithm is a classification algorithm based on Bayes rule and a

    set of conditional independence assumptions. Given the goal of learning P(Y | X )where   X  =  X 1 . . . , X n, the Naive Bayes algorithm makes the assumption thateach  X i   is conditionally independent of each of the other  X k s given  Y , and also

    independent of each subset of the other X k ’s given Y .

    The value of this assumption is that it dramatically simplifies the representa-

    tion of  P( X |Y ), and the problem of estimating it from the training data. Consider,for example, the case where  X  =  X 1, X 2. In this case

  • 8/15/2019 Mitchell Capitulos Extra

    17/30

    Copyright   c 2015, Tom M. Mitchell.   4

    P( X |Y ) =   P( X 1, X 2|Y )=   P( X 1| X 2,Y )P( X 2|Y )=   P( X 1|Y )P( X 2|Y )

    Where the second line follows from a general property of probabilities, and the

    third line follows directly from our above definition of conditional independence.

    More generally, when  X   contains n  attributes which satisfy the conditional inde-

    pendence assumption, we have

    P( X 1 . . . X n|Y ) =n

    ∏i=1

    P( X i|Y )   (1)

    Notice that when Y  and the X i are boolean variables, we need only 2n parameters

    to define P( X 

    i = x

    ik |Y 

     = y

     j) for the necessary  i

    , j

    ,k . This is a dramatic reduction

    compared to the 2(2n −1) parameters needed to characterize  P( X |Y ) if we makeno conditional independence assumption.

    Let us now derive the Naive Bayes algorithm, assuming in general that  Y   is

    any discrete-valued variable, and the attributes  X 1 . . . X n  are any discrete or real-valued attributes. Our goal is to train a classifier that will output the probability

    distribution over possible values of  Y , for each new instance  X  that we ask it to

    classify. The expression for the probability that Y  will take on its  k th possible

    value, according to Bayes rule, is

    P(Y  = yk 

    | X 1 . . . X n) =

      P(Y  = yk )P( X 1 . . . X n|Y  = yk )

    ∑ j P(Y  = y j)P( X 1 . . . X n|Y  = y j)where the sum is taken over all possible values  y j  of  Y . Now, assuming the X i are

    conditionally independent given Y , we can use equation (1) to rewrite this as

    P(Y  = yk | X 1 . . . X n) =   P(Y  = yk )∏i P( X i|Y  = yk )∑ j P(Y  = y j)∏i P( X i|Y  = y j)

      (2)

    Equation (2) is the fundamental equation for the Naive Bayes classifier. Given a

    new instance  X new =  X 1 . . . X n, this equation shows how to calculate the prob-ability that  Y   will take on any given value, given the observed attribute values

    of  X new and given the distributions P(Y ) and  P( X i

    |Y ) estimated from the training

    data. If we are interested only in the most probable value of  Y , then we have theNaive Bayes classification rule:

    Y  ← argmax yk 

    P(Y  = yk )∏i P( X i|Y  = yk )∑ j P(Y  = y j)∏i P( X i|Y  = y j)

    which simplifies to the following (because the denominator does not depend on

     yk ).

    Y  ← argmax yk 

    P(Y  = yk )∏i

    P( X i|Y  = yk )   (3)

  • 8/15/2019 Mitchell Capitulos Extra

    18/30

    Copyright   c 2015, Tom M. Mitchell.   5

    2.3 Naive Bayes for Discrete-Valued Inputs

    To summarize, let us precisely define the Naive Bayes learning algorithm by de-

    scribing the parameters that must be estimated, and how we may estimate them.

    When the   n   input attributes   X i   each take on   J   possible discrete values, and

    Y   is a discrete variable taking on  K  possible values, then our learning task is to

    estimate two sets of parameters. The first is

    θi jk  ≡ P( X i = xi j|Y  = yk )   (4)for each input attribute X i, each of its possible values xi j, and each of the possible

    values  yk   of  Y . Note there will be nJ K  such parameters, and note also that only

    n( J − 1)K  of these are independent, given that they must satisfy 1 = ∑ j θi jk   foreach pair of  i, k  values.

    In addition, we must estimate parameters that define the prior probability over

    Y :

    πk  ≡P(Y  = yk )   (5)Note there are K  of these parameters, (K −1) of which are independent.

    We can estimate these parameters using either maximum likelihood estimates

    (based on calculating the relative frequencies of the different events in the data),

    or using Bayesian MAP estimates (augmenting this observed data with prior dis-

    tributions over the values of these parameters).

    Maximum likelihood estimates for θi jk  given a set of training examples  D  aregiven by

    θ̂i jk  =  P̂( X i = xi j|Y  = yk ) = # D{ X i = xi j ∧Y  = yk }

    # D

    {Y  = yk 

    }

      (6)

    where the # D{ x} operator returns the number of elements in the set D  that satisfyproperty x.

    One danger of this maximum likelihood estimate is that it can sometimes re-

    sult in  θ  estimates of zero, if the data does not happen to contain any trainingexamples satisfying the condition in the numerator. To avoid this, it is common to

    use a “smoothed” estimate which effectively adds in a number of additional “hal-

    lucinated” examples, and which assumes these hallucinated examples are spread

    evenly over the possible values of  X i. This smoothed estimate is given by

    θ̂i jk  =  P̂( X i = xi j|Y  = yk ) = # D{ X i = xi j ∧Y  = yk }+ l

    # D

    {Y  = yk 

    }+ lJ 

    (7)

    where   J   is the number of distinct values   X i   can take on, and   l   determines the

    strength of this smoothing (i.e., the number of hallucinated examples is  l J ). This

    expression corresponds to a MAP estimate for θi jk  if we assume a Dirichlet priordistribution over the θi jk  parameters, with equal-valued parameters. If  l  is set to1, this approach is called Laplace smoothing.

    Maximum likelihood estimates for πk  are

    π̂k  =  P̂(Y  = yk ) = # D{Y  = yk }

    | D|   (8)

  • 8/15/2019 Mitchell Capitulos Extra

    19/30

    Copyright   c 2015, Tom M. Mitchell.   6

    where | D| denotes the number of elements in the training set  D.Alternatively, we can obtain a smoothed estimate, or equivalently a MAP es-

    timate based on a Dirichlet prior over the πk  parameters assuming equal priors oneach πk , by using the following expression

    π̂k  =  P̂(Y  = yk ) = # D{Y  = yk }+ l

    | D|+ lK  (9)

    where K  is the number of distinct values Y  can take on, and l  again determines the

    strength of the prior assumptions relative to the observed data  D.

    2.4 Naive Bayes for Continuous Inputs

    In the case of continuous inputs  X i, we can of course continue to use equations

    (2) and (3) as the basis for designing a Naive Bayes classifier. However, when the

     X i are continuous we must choose some other way to represent the distributions

    P( X i|Y ). One common approach is to assume that for each possible discrete value yk   of  Y , the distribution of each continuous  X i  is Gaussian, and is defined by a

    mean and standard deviation specific to  X i  and  yk . In order to train such a Naive

    Bayes classifier we must therefore estimate the mean and standard deviation of 

    each of these Gaussians:

     µik  = E [ X i|Y  = yk ]   (10)σ2ik  = E [( X i − µik )2|Y  = yk ]   (11)

    for each attribute  X i  and each possible value  yk  of  Y . Note there are 2nK  of these

    parameters, all of which must be estimated independently.

    Of course we must also estimate the priors on Y  as well

    πk  = P(Y  = yk )   (12)

    The above model summarizes a Gaussian Naive Bayes classifier, which as-

    sumes that the data  X  is generated by a mixture of class-conditional (i.e., depen-

    dent on the value of the class variable Y ) Gaussians. Furthermore, the Naive Bayes

    assumption introduces the additional constraint that the attribute values X i are in-

    dependent of one another within each of these mixture components. In particular

    problem settings where we have additional information, we might introduce addi-

    tional assumptions to further restrict the number of parameters or the complexity

    of estimating them. For example, if we have reason to believe that noise in theobserved X i  comes from a common source, then we might further assume that all

    of the σik  are identical, regardless of the attribute  i  or class k  (see the homework exercise on this issue).

    Again, we can use either maximum likelihood estimates (MLE) or maximum

    a posteriori (MAP) estimates for these parameters. The maximum likelihood esti-

    mator for µik  is

    ˆ µik  =  1

    ∑ j δ(Y  j = yk )

    ∑ j

     X  ji δ(Y 

     j = yk )   (13)

  • 8/15/2019 Mitchell Capitulos Extra

    20/30

    Copyright   c 2015, Tom M. Mitchell.   7

    where the superscript   j refers to the   jth training example, and where δ(Y  = yk ) is1 if Y  = yk  and 0 otherwise. Note the role of  δ here is to select only those trainingexamples for which Y  = yk .

    The maximum likelihood estimator for σ2ik  is

    σ̂2ik  =  1

    ∑ j δ(Y  j = yk )∑

     j

    ( X  ji − ˆ µik )2δ(Y  j = yk )   (14)

    This maximum likelihood estimator is biased, so the minimum variance unbi-

    ased estimator (MVUE) is sometimes used instead. It is

    σ̂2ik  =  1

    (∑ j δ(Y  j = yk ))−1∑ j

    ( X  ji − ˆ µik )2δ(Y  j = yk )   (15)

    3 Logistic RegressionLogistic Regression is an approach to learning functions of the form   f   : X  →Y , orP(Y | X )  in the case where Y  is discrete-valued, and  X  =  X 1 . . . X n  is any vectorcontaining discrete or continuous variables. In this section we will primarily con-

    sider the case where Y  is a boolean variable, in order to simplify notation. In the

    final subsection we extend our treatment to the case where  Y  takes on any finite

    number of discrete values.

    Logistic Regression assumes a parametric form for the distribution  P(Y | X ),then directly estimates its parameters from the training data. The parametric

    model assumed by Logistic Regression in the case where Y  is boolean is:

    P(Y  = 1| X ) =   11 + exp(w0 +∑

    ni=1 wi X i)

      (16)

    and

    P(Y  = 0| X ) =   exp(w0 +∑ni=1 wi X i)

    1 + exp(w0 +∑ni=1 wi X i)

      (17)

    Notice that equation (17) follows directly from equation (16), because the sum of 

    these two probabilities must equal 1.

    One highly convenient property of this form for  P(Y | X )   is that it leads to asimple linear expression for classification. To classify any given X  we generally

    want to assign the value  yk   that maximizes   P(Y   = yk | X ). Put another way, weassign the label Y  = 0 if the following condition holds:

    1 < P(Y  = 0| X )P(Y  = 1| X )

    substituting from equations (16) and (17), this becomes

  • 8/15/2019 Mitchell Capitulos Extra

    21/30

    Copyright   c 2015, Tom M. Mitchell.   8

    Figure 1: Form of the logistic function. In Logistic Regression,   P(Y | X )   is as-sumed to follow this form.

    and taking the natural log of both sides we have a linear classification rule that

    assigns label Y  = 0 if  X  satisfies

  • 8/15/2019 Mitchell Capitulos Extra

    22/30

  • 8/15/2019 Mitchell Capitulos Extra

    23/30

  • 8/15/2019 Mitchell Capitulos Extra

    24/30

    Copyright   c 2015, Tom M. Mitchell.   11

    value of  X  in the  l th training example. The expression to the right of the argmax

    is the conditional data likelihood. Here we include W  in the conditional, to em-

    phasize that the expression is a function of the W  we are attempting to maximize.

    Equivalently, we can work with the log of the conditional likelihood:

    W  ← argmaxW   ∑

    l

    ln P(Y l | X l ,W )

    This conditional data log likelihood, which we will denote  l(W ) can be writtenas

    l(W ) =∑l

    Y l ln P(Y l = 1| X l,W ) + (1−Y l ) ln P(Y l = 0| X l,W )

    Note here we are utilizing the fact that Y  can take only values 0 or 1, so only one

    of the two terms in the expression will be non-zero for any given Y l .

    To keep our derivation consistent with common usage, we will in this section

    flip the assignment of the boolean variable Y  so that we assign

    P(Y  = 0| X ) =   11 + exp(w0 +∑

    ni=1 wi X i)

      (24)

    and

    P(Y  = 1| X ) =   exp(w0 +∑ni=1 wi X i)

    1 + exp(w0 +∑ni=1 wi X i)

      (25)

    In this case, we can reexpress the log of the conditional likelihood as:

    l(W ) =   ∑l

    Y l ln P(Y l = 1| X l,W ) + (1−Y l ) ln P(Y l = 0| X l,W )

    =   ∑l

    Y l ln P(Y l = 1| X l ,W )P(Y l = 0| X l ,W ) + ln P(Y 

    l = 0| X l,W )

    =   ∑l

    Y l (w0 +n

    ∑i

    wi X li )− ln(1 + exp(w0 +

    n

    ∑i

    wi X li ))

    where X li  denotes the value of  X i for the lth training example. Note the superscript

    l  is not related to the log likelihood function  l(W ).Unfortunately, there is no closed form solution to maximizing  l (W )  with re-

    spect to W . Therefore, one common approach is to use gradient ascent, in which

    we work with the gradient, which is the vector of partial derivatives. The   ith

    component of the vector gradient has the form

    ∂l(W )

    ∂wi=∑

    l

     X li (Y l −  P̂(Y l = 1| X l,W ))

    where  P̂(Y l | X l ,W ) is the Logistic Regression prediction using equations (24) and(25) and the weights  W . To accommodate weight   w0, we assume an illusory

     X 0 =  1 for all  l . This expression for the derivative has an intuitive interpretation:the term inside the parentheses is simply the prediction error; that is, the difference

  • 8/15/2019 Mitchell Capitulos Extra

    25/30

    Copyright   c 2015, Tom M. Mitchell.   12

    between the observed Y l and its predicted probability! Note if Y l = 1 then we wishfor  P̂(Y l = 1| X l,W ) to be 1, whereas if Y l = 0 then we prefer that  P̂(Y l = 1| X l,W )be 0 (which makes  P̂(Y l = 0| X l,W ) equal to 1). This error term is multiplied bythe value of  X li , which accounts for the magnitude of the wi X 

    li  term in making this

    prediction.Given this formula for the derivative of each wi, we can use standard gradient

    ascent to optimize the weights  W . Beginning with initial weights of zero, we

    repeatedly update the weights in the direction of the gradient, on each iteration

    changing every weight wi according to

    wi ← wi +η∑l

     X li (Y l −  P̂(Y l = 1| X l ,W ))

    where η  is a small constant (e.g., 0.01) which determines the step size. Becausethe conditional log likelihood l(W ) is a concave function in W , this gradient ascent

    procedure will converge to a global maximum. Gradient ascent is described ingreater detail, for example, in Chapter 4 of Mitchell (1997). In many cases where

    computational efficiency is important it is common to use a variant of gradient

    ascent called conjugate gradient ascent, which often converges more quickly.

    3.3 Regularization in Logistic Regression

    Overfitting the training data is a problem that can arise in Logistic Regression,

    especially when data is very high dimensional and training data is sparse. One

    approach to reducing overfitting is   regularization, in which we create a modified

    “penalized log likelihood function,” which penalizes large values of  W . One ap-

    proach is to use the penalized log likelihood function

    W  ← argmaxW   ∑

    l

    ln P(Y l| X l,W )− λ2||W ||2

    which adds a penalty proportional to the squared magnitude of  W . Here λ   is aconstant that determines the strength of this penalty term.

    Modifying our objective by adding in this penalty term gives us a new objec-

    tive to maximize. It is easy to show that maximizing it corresponds to calculating

    the MAP estimate for W  under the assumption that the prior distribution P(W ) isa Normal distribution with mean zero, and a variance related to 1/λ. Notice that

    in general, the MAP estimate for W  involves optimizing the objective

    ∑l

    ln P(Y l| X l,W ) + ln P(W )

    and if   P(W )   is a zero mean Gaussian distribution, then ln P(W )   yields a termproportional to ||W ||2.

    Given this penalized log likelihood function, it is easy to rederive the gradient

    descent rule. The derivative of this penalized log likelihood function is similar to

  • 8/15/2019 Mitchell Capitulos Extra

    26/30

  • 8/15/2019 Mitchell Capitulos Extra

    27/30

    Copyright   c 2015, Tom M. Mitchell.   14

    4 Relationship Between Naive Bayes Classifiers and

    Logistic Regression

    To summarize, Logistic Regression directly estimates the parameters of  P(Y 

    | X ),

    whereas Naive Bayes directly estimates parameters for P(Y ) and  P( X |Y ). We of-ten call the former a discriminative classifier, and the latter a generative classifier.

    We showed above that the assumptions of one variant of a Gaussian Naive

    Bayes classifier imply the parametric form of   P(Y | X )   used in Logistic Regres-sion. Furthermore, we showed that the parameters  wi  in Logistic Regression can

    be expressed in terms of the Gaussian Naive Bayes parameters. In fact, if the GNB

    assumptions hold, then asymptotically (as the number of training examples grows

    toward infinity) the GNB and Logistic Regression converge toward identical clas-

    sifiers.

    The two algorithms also differ in interesting ways:

    •   When the GNB modeling assumptions do not hold, Logistic Regression andGNB typically learn different classifier functions. In this case, the asymp-

    totic (as the number of training examples approach infinity) classification

    accuracy for Logistic Regression is often better than the asymptotic accu-

    racy of GNB. Although Logistic Regression is consistent with the Naive

    Bayes assumption that the input features  X i   are conditionally independent

    given  Y , it is not rigidly tied to this assumption as is Naive Bayes. Given

    data that disobeys this assumption, the conditional likelihood maximization

    algorithm for Logistic Regression will adjust its parameters to maximize the

    fit to (the conditional likelihood of) the data, even if the resulting parameters

    are inconsistent with the Naive Bayes parameter estimates.

    •  GNB and Logistic Regression converge toward their asymptotic accuraciesat different rates. As Ng & Jordan (2002) show, GNB parameter estimates

    converge toward their asymptotic values in order log n   examples, where  n

    is the dimension of  X . In contrast, Logistic Regression parameter estimates

    converge more slowly, requiring order n  examples. The authors also show

    that in several data sets Logistic Regression outperforms GNB when many

    training examples are available, but GNB outperforms Logistic Regression

    when training data is scarce.

    5 What You Should Know

    The main points of this chapter include:

    •  We can use Bayes rule as the basis for designing learning algorithms (func-tion approximators), as follows: Given that we wish to learn some target

    function   f   : X  → Y , or equivalently,   P(Y | X ), we use the training data tolearn estimates of  P( X |Y )  and  P(Y ). New  X  examples can then be classi-fied using these estimated probability distributions, plus Bayes rule. This

  • 8/15/2019 Mitchell Capitulos Extra

    28/30

    Copyright   c 2015, Tom M. Mitchell.   15

    type of classifier is called a  generative  classifier, because we can view the

    distribution P( X |Y ) as describing how to generate random instances  X  con-ditioned on the target attribute Y .

    •  Learning Bayes classifiers typically requires an unrealistic number of train-ing examples (i.e., more than | X | training examples where X  is the instancespace) unless some form of prior assumption is made about the form of 

    P( X |Y ). The   Naive Bayes   classifier assumes all attributes describing   X are conditionally independent given  Y . This assumption dramatically re-

    duces the number of parameters that must be estimated to learn the classi-

    fier. Naive Bayes is a widely used learning algorithm, for both discrete and

    continuous X .

    •   When  X  is a vector of discrete-valued attributes, Naive Bayes learning al-gorithms can be viewed as linear classifiers; that is, every such Naive Bayes

    classifier corresponds to a hyperplane decision surface in X . The same state-ment holds for Gaussian Naive Bayes classifiers if the variance of each fea-

    ture is assumed to be independent of the class (i.e., if  σik  = σi).

    •   Logistic Regression is a function approximation algorithm that uses trainingdata to directly estimate  P(Y | X ), in contrast to Naive Bayes. In this sense,Logistic Regression is often referred to as a discriminative classifier because

    we can view the distribution  P(Y | X ) as directly discriminating the value of the target value Y  for any given instance  X .

    •   Logistic Regression is a linear classifier over X . The linear classifiers pro-duced by Logistic Regression and Gaussian Naive Bayes are identical in

    the limit as the number of training examples approaches infinity,   provided 

    the Naive Bayes assumptions hold. However, if these assumptions do not

    hold, the Naive Bayes bias will cause it to perform less accurately than Lo-

    gistic Regression, in the limit. Put another way, Naive Bayes is a learning

    algorithm with greater bias, but lower variance, than Logistic Regression. If 

    this bias is appropriate given the actual data, Naive Bayes will be preferred.

    Otherwise, Logistic Regression will be preferred.

    •  We can view function approximation learning algorithms as statistical esti-mators of functions, or of conditional distributions P(Y | X ). They estimateP

    (Y | X 

    )  from a sample of training data. As with other statistical estima-

    tors, it can be useful to characterize learning algorithms by their bias and

    expected variance, taken over different samples of training data.

    6 Further Reading

    Wasserman (2004) describes a Reweighted Least Squares method for Logistic

    Regression. Ng and Jordan (2002) provide a theoretical and experimental com-

    parison of the Naive Bayes classifier and Logistic Regression.

  • 8/15/2019 Mitchell Capitulos Extra

    29/30

    Copyright   c 2015, Tom M. Mitchell.   16

    EXERCISES

    1. At the beginning of the chapter we remarked that “A hundred training ex-

    amples will usually suffice to obtain an estimate of  P(Y )  that is within a

    few percent of the correct value.” Describe conditions under which the 95%confidence interval for our estimate of  P(Y ) will be ±0.02.

    2. Consider learning a function X  →Y  where Y  is boolean, where X  =  X 1, X 2,and where X 1  is a boolean variable and  X 2  a continuous variable. State the

    parameters that must be estimated to define a Naive Bayes classifier in this

    case. Give the formula for computing  P(Y | X ), in terms of these parametersand the feature values X 1  and  X 2.

    3. In section 3 we showed that when Y   is Boolean and   X  =  X 1 . . . X n   is avector of continuous variables, then the assumptions of the Gaussian Naive

    Bayes classifier imply that   P(Y | X )   is given by the logistic function with

    appropriate parameters W . In particular:

    P(Y  = 1| X ) =   11 + exp(w0 +∑

    ni=1 wi X i)

    and

    P(Y  = 0| X ) =   exp(w0 +∑ni=1 wi X i)

    1 + exp(w0 +∑ni=1 wi X i)

    Consider instead the case where Y  is Boolean and  X  =  X 1 . . . X n is a vec-tor of  Boolean  variables. Prove for this case also that P(Y | X ) follows thissame form (and hence that Logistic Regression is also the discriminative

    counterpart to a Naive Bayes generative classifier over Boolean features).

     Hints:

    •   Simple notation will help. Since the   X i   are Boolean variables, youneed only one parameter to define  P( X i|Y  = yk ). Define θi1 ≡ P( X i =1|Y  = 1), in which case  P( X i =  0|Y  = 1) = (1− θi1). Similarly, useθi0  to denote P( X i = 1|Y  = 0).

    •  Notice with the above notation you can represent  P( X i|Y  = 1) as fol-lows

    P( X i

    |Y  = 1) =   θ X ii1   (1

    −θi1)

    (1− X i)

    Note when  X i =  1 the second term is equal to 1 because its exponentis zero. Similarly, when  X i =  0 the first term is equal to 1 because itsexponent is zero.

    4. (based on a suggestion from Sandra Zilles). This question asks you to con-

    sider the relationship between the MAP hypothesis and the Bayes optimal

    hypothesis. Consider a hypothesis space  H  defined over the set of instances

     X , and containing just two hypotheses, h1 and h2 with equal prior probabil-

    ities P(h1) = P(h2) = 0.5. Suppose we are given an arbitrary set of training

  • 8/15/2019 Mitchell Capitulos Extra

    30/30

    Copyright   c 2015, Tom M. Mitchell.   17

    data  D  which we use to calculate the posterior probabilities  P(h1| D)  andP(h2| D). Based on this we choose the MAP hypothesis, and calculate theBayes optimal hypothesis. Suppose we find that the Bayes optimal classi-

    fier is not equal to either  h1 or to  h2, which is generally the case because

    the Bayes optimal hypothesis corresponds to “averaging over” all hypothe-ses in  H . Now we create a new hypothesis  h3 which is equal to the Bayes

    optimal classifier with respect to  H , X   and  D; that is, h3 classifies each in-

    stance in  X  exactly the same as the Bayes optimal classifier for  H   and  D.

    We now create a new hypothesis space  H  = {h1, h2, h3}. If we train usingthe same training data, D, will the MAP hypothesis from H  be  h3? Will theBayes optimal classifier with respect to  H   be equivalent to  h3? (Hint: theanswer depends on the priors we assign to the hypotheses in  H . Can yougive constraints on these priors that assure the answers will be yes or no?)

    7 Acknowledgements

    I very much appreciate receiving helpful comments on earlier drafts of this chapter

    from the following: Nathaniel Fairfield, Rainer Gemulla, Vineet Kumar, Andrew

    McCallum, Anand Prahlad, Wei Wang, Geoff Webb, and Sandra Zilles.

    REFERENCES

    Mitchell, T (1997).  Machine Learning, McGraw Hill.

    Ng, A.Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A compar-

    ison of Logistic Regression and Naive Bayes, Neural Information Processing Systems, Ng,A.Y., and Jordan, M. (2002).

    Wasserman, L. (2004). All of Statistics, Springer-Verlag.