Nyu Bayes1 2012

Embed Size (px)

Citation preview

  • 7/30/2019 Nyu Bayes1 2012

    1/42

    Introduction to Bayesian Analysis

    Timothy CogleyNYU Fall 2012

    September 12, 2012

  • 7/30/2019 Nyu Bayes1 2012

    2/42

  • 7/30/2019 Nyu Bayes1 2012

    3/42

    2 Classical v. Bayesian Perspectives

    Common elements

    =

    () = true distribution

    (|) = likelihood function for a parametric model

    =

  • 7/30/2019 Nyu Bayes1 2012

    4/42

    2.1 Classical perspective

    There exists a true, fixed value of about which we want to learn.

    We form an estimate e.g. by MLE.

    We characterize the properties of by imagining how it would behave across re-peated samples.

    Because varies across samples, we treat it as a random variable.

    The distribution of across samples summarizes what we know about

    Want estimators or test procedures that do well on average.

    But concept of repeated samples is often just a convenient fiction; usually thereis only one sample.

    Reliance on this concept gives rise to the term frequentist.

  • 7/30/2019 Nyu Bayes1 2012

    5/42

    2.2 Bayesian perspective

    Want to characterize subjective beliefs about

    Posit that is a random variable.

    Distributions over summarize subjective beliefs.

    If is a r.v., what is ?

    a deterministic function of the sample

    e.g., for OLS, = (0)10is a deterministic function of

    Only one can emerge in a given sample.

    Conditional on there is no uncertainty about

  • 7/30/2019 Nyu Bayes1 2012

    6/42

    Bayes theorem tells us how optimally to update beliefs after seeing data. I.e., it is

    about the passage from prior to posterior beliefs.

    Before looking at data

    () =

    After looking at data

    (|) =

    These judgments are mediated by a model of how and are related. A model

    is a probability distribution over outcomes, (|)

  • 7/30/2019 Nyu Bayes1 2012

    7/42

    Another way to distinguish Bayesian and classical perspectives is in terms ofcondi-

    tioning.

    Bayes: inference is conditional on the sample at hand.

    Classical: inference depends on what could happen in other, hypothetical sam-

    ples that could be drawn from the same mechanism.

    My view: although there are interesting philosophical differences, I regard them as

    alternative tools. Sometimes one is more convenient, sometimes another.

    One reason why I like Bayesian methods: they connect well with Bayesian decisiontheory. For policy-oriented macroeconomists, this is a big advantage.

  • 7/30/2019 Nyu Bayes1 2012

    8/42

    3 Bayes Theorem

    Can factor a joint density into the product of a conditional and a marginal

    ( ) = (|)() = (|)()

    Can also marginalize a joint density by integrating wrt other argument

    () =Z

    ( )

    Bayes theorem follows from these two facts

    Start with the joint density ( )

    Can factor in two ways

    ( ) = (|)() = (|)()

  • 7/30/2019 Nyu Bayes1 2012

    9/42

    Divide through by ()

    (|) =

    (|)()

    ()

    The marginal () can also be expressed as

    () =Z

    ( ) =Z

    (|)()

    Hence the posterior is

    (|) =(|)()R(|)()

    (|)()

  • 7/30/2019 Nyu Bayes1 2012

    10/42

    Terminology:

    ( ) = (|)() is known as the posterior kernel

    () = R(|)() is known as the normalizing constant or marginallikelihood

    (|) is known as the posterior density.

  • 7/30/2019 Nyu Bayes1 2012

    11/42

    Remarks:

    Notice that() does not depend on All the information about is contained

    in the kernel

    The posterior kernel does not necessarily integrate to 1, so it is not a proper

    probability density.

    When we divide the kernel by the normalizing constant, we ensure that theresult does integrate to 1.

    Important to verify that () exists

    it does when priors are proper, i.e. when () integrates to 1

    problems sometimes arise when () is improper

  • 7/30/2019 Nyu Bayes1 2012

    12/42

    4 Examples

    4.1 Mean of iid normal random vector with known variance

    Suppose () with known but unknown

    Prior: () (00) (Interpret prior precision for the univariate case)

    (|0) exp[(12)( 0)010 ( 0)]

    Likelihood:

    (|0) exp[(12)X

    ( )

    010 ( )]

    Posterior Kernel: multiply the prior and likelihood

  • 7/30/2019 Nyu Bayes1 2012

    13/42

    (|0) (|0)(|0)

    = exp{(12)[( 0)010 ( 0) +

    X( )

    010 ( )]} (1)

    Notice that this is a quadratic within an exponential, making it a Gaussian kernel.

    It follows that we can express the posterior kernel as

    (|0) exp[(12)( 1)011 ( 1)] (2)

    To find expressions for 11 expand (1) and equate powers with (2).

    ( 0)010 ( 0) =

    010 +

    0010 0

    010 0

    0010

  • 7/30/2019 Nyu Bayes1 2012

    14/42

    X( )

    010 ( ) =

    X[0

    10 +

    010

    010

    010 ]

    Notice that

    X

    010 = 010 X

    0

    10 =

    010 X

    010 =

    010

    Collecting terms in

    ( 0)010 ( 0) + P( )

    010 ( )] =

    = 0[10 + 10 ]

    0[10 0 + 10 ] [

    0010 +

    10 ]

    + terms not involving

  • 7/30/2019 Nyu Bayes1 2012

    15/42

    Next, equate powers with terms in (2)

    011 = 0[10 +

    10 ] 1 = [

    10 +

    10 ]

    1

    011 1 = 0[10 0 + 10 ]

    11 1 = [

    10 0 +

    10 ]

    1 = 1[10 0 +

    10 ]

    = [10 + 10 ]1[10 0 +

    10 ]

  • 7/30/2019 Nyu Bayes1 2012

    16/42

    Interpretation:

    The posterior mean is a variance-weighted average of the prior mean and the

    sample average.

    The relative weights depend on the prior variance 1

    0

    the sample size

    and the variance of (i.e. 0).

    If the prior is diffuse (10 = 0) the posterior mean simplifies to

    1 = [

    10 ]

    1[

    10 ] =

    Otherwise the prior pulls 1 away from toward 0

    Hence the posterior mean is biased. (Is that a bad thing?)

  • 7/30/2019 Nyu Bayes1 2012

    17/42

    Similarly, the posterior precision 11 is a variance-weighted average of the prior

    and sample precisions.

    If the prior is diffuse, the posterior variance simplifies to

    1 = [10 ]1 = 10

    the variance of the sample mean.

    If the prior adds information (10 = 0) it reduces the posterior variance.

  • 7/30/2019 Nyu Bayes1 2012

    18/42

    4.2 Variance of iid normal random vector with known mean

    Now suppose () with known = 0 but unknown

    Prior: for a Gaussian likelihood, the conjugate prior is inverse Wishart

    () = (10 0)

    ||(0++1)2 exp{(12)(10)}

    where 0 is the prior degrees of freedom, 0 is the prior scale or sum of squares

    matrix, and = dim()

    Conjugate: the posterior has the same functional form as the prior

    hence, functional form for posterior is known

    consequently, you just need to update its sufficient statistics

  • 7/30/2019 Nyu Bayes1 2012

    19/42

    Wishart is a multivariate generalization of a chi-square.

    Sufficient statistics for IW:

    1 = 0 + 1 = 0 +

    X( 0)( 0)

    0

    Deriving the updating rule

    Start with the log prior

    log() = (12)(0 + + 1) log || (12)(10)

    Add to it the Gaussian log likelihood

    log(|0) = ( 2) log || (12)X

    ( 0)0

    1( 0)

  • 7/30/2019 Nyu Bayes1 2012

    20/42

    The result is the log posterior kernel

    log (| 0) = (12)( + 0 + + 1) log ||

    (12)(10) (12)X

    ( 0)01( 0)

    The last term on the right-hand side is

    X(

    0)01(

    0) = hX

    (

    0)01(

    0)i

    = h1

    X( 0)( 0)

    0i

    = h1

    i

    where

    X

    ( 0)( 0)

    0

    Hence the log posterior kernel is

    log (| 0) = (12)( + 0 + + 1) log ||

    (12)(10) (12)(1)

    =

    (12)( + 0 + + 1) log ||

    (12)[1(0 + )]

  • 7/30/2019 Nyu Bayes1 2012

    21/42

  • 7/30/2019 Nyu Bayes1 2012

    22/42

    4.3 iid normal random vector with unknown mean and variance

    This can be derived analytically, leading to a matrix variate t-distribution (e.g. see

    Bauwens, et. al.)

    But I want to describe an alternative approach that will be used later when discussingGibbs sampling.

    The Gibbs sampler uses a models conditional densities to simulate its joint density.

    For the joint distribution () the conditionals are (|) and (|)

    The two previous examples describe each of the conditionals.

  • 7/30/2019 Nyu Bayes1 2012

    23/42

    The Gibbs sampler simulates the joint posterior by drawing sequentially from

    the conditionals (more later).

    Thus we can decompose a complicated probability model into a collection of simple

    ones. We will use this idea a lot.

  • 7/30/2019 Nyu Bayes1 2012

    24/42

    4.4 A Regression Model with a Known Innovation Variance

    = 0 +

    is strictly exogenous and is (0 2)

    is unknown, but () is known.

    Prior: ( 2)

    () |2|12 exp[(12)( )0(2)1( )]

    Likelihood: | is conditionally Gaussian

    (|2) exp[(12)X(

    0)

    2

    2

  • 7/30/2019 Nyu Bayes1 2012

    25/42

    Posterior: By following steps analogous to those for example (4.1), one can show

    (| 2) = (1 1)

    where

    1 = (21 + 20)1

    1 = 1(2

    1

    + 2

    0)

    Once again, the posterior mean is a precision weighted average of prior and sample

    information.

    As the prior precision 1 0 this converges to the usual OLS result.

  • 7/30/2019 Nyu Bayes1 2012

    26/42

    4.5 A Regression Model with Known

    Same model, but now the innovation variance is unknown and the conditional mean

    parameters are known.

    This collapses to estimating the variance for a univariate iid process whose mean isknown to be zero.

    Likelihood is

    (| 2) exp

    (X

    ( 0)

    2

    22

    )

    Prior: inverse-gamma density is conjugate to a normal likelihood

    (2)

    202

    !2+2exp

    2022

    !

    where represents prior degrees of freedom, and 20 is a prior scale parameter

  • 7/30/2019 Nyu Bayes1 2012

    27/42

    To find the posterior kernel, multiply the prior and likelihood. After some algebra,

    (2|) 2

    02!2++2

    exp

    2

    0+

    22!

    where

    = X( 0)2The posterior degrees of freedom are the sum of prior and the number of obser-

    vations,

    = +

    The posterior sum of squares is the sum of the prior sum of squares plus the sample

    sum of squares,

    = 20 +

    To find the posterior, we just need to update the scale and degree of freedom

    parameters.

    2

  • 7/30/2019 Nyu Bayes1 2012

    28/42

    4.6 A Regression Model with Unknown and 2

    Use the Gibbs idea along with the results of the previous two subsections.

    ff

  • 7/30/2019 Nyu Bayes1 2012

    29/42

    4.7 Multivariate normal model with Jeffreys prior and known

    variance

    To reflect the absence of prior information about conditional mean parameters,

    Jeffreys proposed

    (|) ||(+1)2

    where = dim()

    This can be interpreted as the limit of the conjugate normal as 1

    0 0

    In this case, the posterior is

    (|) ||(+1)2 exp[(12)X

    ( )01( )]

    = ( 1)

    This can be combined with a marginal prior for , using the Gibbs idea.

    4 8 VAR i h J ff i (Z ll 1970)

  • 7/30/2019 Nyu Bayes1 2012

    30/42

    4.8 VAR with Jeffreys prior (Zellner 1970)

    () =

    (0)

    = constants plus VAR parameters

    Prior:

    () = (|)()

    () =

    (|) =

    L k l h d f G

  • 7/30/2019 Nyu Bayes1 2012

    31/42

    Likelihood function: Gaussian

    Full conditionals:

    (| ) ()(|)

    This is the same as for the covariance matrix for a multivariate normal randomvector.

    (| ) (|)(|)

    = ||(++1)2 exp{(12)

    X[()]

    01[()]

    This is the SUR criterion.

    I f ll h i di i ll l i h d i

  • 7/30/2019 Nyu Bayes1 2012

    32/42

    It follows that is conditionally normal with mean and variance

    (| ) = [0(1 )]10(1 )

    (| ) = [0(1 )]1

    where and represent the left- and right-hand variables, respectively.

    4 9 Probability of a binomial random variable

  • 7/30/2019 Nyu Bayes1 2012

    33/42

    4.9 Probability of a binomial random variable

    Suppose takes on 2 values, with probability and 1, respectively.

    Likelihood:

    (

    |)

    1(1

    )

    2

    where 1 = number of draws of state 1, 2 = number of draws of state 2, and

    1 + 2 = total number of draws.

    Prior: The conjugate prior for a binomial likelihood is a beta denstity,

    () 101(1)

    201

    To find the posterior kernel, multiply the likelihood and the prior,

    (|) 1 +

    101(1)

    2 +

    201

  • 7/30/2019 Nyu Bayes1 2012

    34/42

    4 10 Probabilities for a multinomial random variable

  • 7/30/2019 Nyu Bayes1 2012

    35/42

    4.10 Probabilities for a multinomial random variable

    What if is discrete but takes on more than 2 values, with probabilities 1 2 ,

    respectively?

    Much like the previous case, but ...

    likelihood is multinomial

    conjugate prior is a Dirichlet density.

    4 11 Transition Probabilities for a Markov-Switching Model

  • 7/30/2019 Nyu Bayes1 2012

    36/42

    4.11 Transition Probabilities for a Markov-Switching Model

    Suppose takes on 2 values, with transition probability matrix

    =

    " 1

    1

    #

    Likelihood:

    conditional on being in state at time , +1 is a binomial random variable.

    By factoring the joint density for into the product of conditionals (and

    ignoring the initial condition), the likelihood can be expressed as

    (| ) 11 (1)

    12

    22 (1 )

    21

    Prior: are independent beta random variables

    ( ) = ()()

  • 7/30/2019 Nyu Bayes1 2012

    37/42

    () 110 1(1)

    120 1

    () 220 1(1 )210 1

    To find the posterior kernel, multiply prior and likelihood,

    ( |) 11 +

    110 1(1)

    12 +

    120 1

    22 +

    220 1(1 )

    21 +

    210 1

    (|)(|)

    Hence the posterior is also a product of binomial densities.

    4.12 Markov-switching models with more than 2 states

  • 7/30/2019 Nyu Bayes1 2012

    38/42

    4.12 Markov switching models with more than 2 states

    Follow the same steps as above, but use multinomial-Dirichlet probability models.

    5 Examples of objects that Bayesian want to com-

  • 7/30/2019 Nyu Bayes1 2012

    39/42

    5 Examples of objects that Bayesian want to com

    pute

    posterior mode: argmax (|) (analogous to MLE)

    posterior mean

    (|) =Z

    (|)

    Posterior variance

    (|) =Z

    [ (|)][ (|)]0(|)

    other posterior moments, expected loss

    (()) =Z

    ()(|)

    Credible sets (confidence intervals)

  • 7/30/2019 Nyu Bayes1 2012

    40/42

    C ed b e sets (co de ce te a s)

    ( |) = Z

    (|)

    (The set can be adjusted to deliver a pre-specified probability)

    normalizing constants (marginal likelihood)

    () =Z

    (|)()

    Bayes factors (analogous to likelihood ratio)

    =(|1)

    (|2)=

    R(|1 1)(1)1R(|2 2)(2)2

    (Remark on plug-in v. marginalization. Latter accounts better for parameter

    uncertainty.)

    Posterior odds ratios (model comparison)

    =

    (1)

    (2)

    (|1)

    (|2) = prior odds

    6 Bayesian computations

  • 7/30/2019 Nyu Bayes1 2012

    41/42

    y p

    6.1 Old School (Zellner 1970)

    Form conjugate prior and likelihood.

    Derive analytical expression for posterior distribution.

    Evaluate integrals analytically

    Some interesting models can be formulated this way, but this is a restrictive class.

    6.2 New Wave

  • 7/30/2019 Nyu Bayes1 2012

    42/42

    Can break out of conjugate families by resorting to asymptotic approximations andnumerical methods.

    How to do that will be our main subject over the next few lectures.