Nyu Bayes1 2012

7/30/2019 Nyu Bayes1 2012

1/42

Introduction to Bayesian Analysis

Timothy CogleyNYU Fall 2012

September 12, 2012

7/30/2019 Nyu Bayes1 2012

2/42

7/30/2019 Nyu Bayes1 2012

3/42

2 Classical v. Bayesian Perspectives

Common elements

=

() = true distribution

(|) = likelihood function for a parametric model

=

7/30/2019 Nyu Bayes1 2012

4/42

2.1 Classical perspective

There exists a true, fixed value of about which we want to learn.

We form an estimate e.g. by MLE.

We characterize the properties of by imagining how it would behave across re-peated samples.

Because varies across samples, we treat it as a random variable.

The distribution of across samples summarizes what we know about

Want estimators or test procedures that do well on average.

But concept of repeated samples is often just a convenient fiction; usually thereis only one sample.

Reliance on this concept gives rise to the term frequentist.

7/30/2019 Nyu Bayes1 2012

5/42

2.2 Bayesian perspective

Want to characterize subjective beliefs about

Posit that is a random variable.

Distributions over summarize subjective beliefs.

If is a r.v., what is ?

a deterministic function of the sample

e.g., for OLS, = (0)10is a deterministic function of

Only one can emerge in a given sample.

Conditional on there is no uncertainty about

7/30/2019 Nyu Bayes1 2012

6/42

Bayes theorem tells us how optimally to update beliefs after seeing data. I.e., it is

about the passage from prior to posterior beliefs.

Before looking at data

() =

After looking at data

(|) =

These judgments are mediated by a model of how and are related. A model

is a probability distribution over outcomes, (|)

7/30/2019 Nyu Bayes1 2012

7/42

Another way to distinguish Bayesian and classical perspectives is in terms ofcondi-

tioning.

Bayes: inference is conditional on the sample at hand.

Classical: inference depends on what could happen in other, hypothetical sam-

ples that could be drawn from the same mechanism.

My view: although there are interesting philosophical differences, I regard them as

alternative tools. Sometimes one is more convenient, sometimes another.

One reason why I like Bayesian methods: they connect well with Bayesian decisiontheory. For policy-oriented macroeconomists, this is a big advantage.

7/30/2019 Nyu Bayes1 2012

8/42

3 Bayes Theorem

Can factor a joint density into the product of a conditional and a marginal

( ) = (|)() = (|)()

Can also marginalize a joint density by integrating wrt other argument

() =Z

( )

Bayes theorem follows from these two facts

Start with the joint density ( )

Can factor in two ways

( ) = (|)() = (|)()

7/30/2019 Nyu Bayes1 2012

9/42

Divide through by ()

(|) =

(|)()

()

The marginal () can also be expressed as

() =Z

( ) =Z

(|)()

Hence the posterior is

(|) =(|)()R(|)()

(|)()

7/30/2019 Nyu Bayes1 2012

10/42

Terminology:

( ) = (|)() is known as the posterior kernel

() = R(|)() is known as the normalizing constant or marginallikelihood

(|) is known as the posterior density.

7/30/2019 Nyu Bayes1 2012

11/42

Remarks:

Notice that() does not depend on All the information about is contained

in the kernel

The posterior kernel does not necessarily integrate to 1, so it is not a proper

probability density.

When we divide the kernel by the normalizing constant, we ensure that theresult does integrate to 1.

Important to verify that () exists

it does when priors are proper, i.e. when () integrates to 1

problems sometimes arise when () is improper

7/30/2019 Nyu Bayes1 2012

12/42

4 Examples

4.1 Mean of iid normal random vector with known variance

Suppose () with known but unknown

Prior: () (00) (Interpret prior precision for the univariate case)

(|0) exp[(12)( 0)010 ( 0)]

Likelihood:

(|0) exp[(12)X

( )

010 ( )]

Posterior Kernel: multiply the prior and likelihood

7/30/2019 Nyu Bayes1 2012

13/42

(|0) (|0)(|0)

= exp{(12)[( 0)010 ( 0) +

X( )

010 ( )]} (1)

Notice that this is a quadratic within an exponential, making it a Gaussian kernel.

It follows that we can express the posterior kernel as

(|0) exp[(12)( 1)011 ( 1)] (2)

To find expressions for 11 expand (1) and equate powers with (2).

( 0)010 ( 0) =

010 +

0010 0

010 0

0010

7/30/2019 Nyu Bayes1 2012

14/42

X( )

010 ( ) =

X[0

10 +

010

010

010 ]

Notice that

X

010 = 010 X

0

10 =

010 X

010 =

010

Collecting terms in

( 0)010 ( 0) + P( )

010 ( )] =

= 0[10 + 10 ]

0[10 0 + 10 ] [

0010 +

10 ]

+ terms not involving

7/30/2019 Nyu Bayes1 2012

15/42

Next, equate powers with terms in (2)

011 = 0[10 +

10 ] 1 = [

10 +

10 ]

1

011 1 = 0[10 0 + 10 ]

11 1 = [

10 0 +

10 ]

1 = 1[10 0 +

10 ]

= [10 + 10 ]1[10 0 +

10 ]

7/30/2019 Nyu Bayes1 2012

16/42

Interpretation:

The posterior mean is a variance-weighted average of the prior mean and the

sample average.

The relative weights depend on the prior variance 1

0

the sample size

and the variance of (i.e. 0).

If the prior is diffuse (10 = 0) the posterior mean simplifies to

1 = [

10 ]

1[

10 ] =

Otherwise the prior pulls 1 away from toward 0

Hence the posterior mean is biased. (Is that a bad thing?)

7/30/2019 Nyu Bayes1 2012

17/42

Similarly, the posterior precision 11 is a variance-weighted average of the prior

and sample precisions.

If the prior is diffuse, the posterior variance simplifies to

1 = [10 ]1 = 10

the variance of the sample mean.

If the prior adds information (10 = 0) it reduces the posterior variance.

7/30/2019 Nyu Bayes1 2012

18/42

4.2 Variance of iid normal random vector with known mean

Now suppose () with known = 0 but unknown

Prior: for a Gaussian likelihood, the conjugate prior is inverse Wishart

() = (10 0)

||(0++1)2 exp{(12)(10)}

where 0 is the prior degrees of freedom, 0 is the prior scale or sum of squares

matrix, and = dim()

Conjugate: the posterior has the same functional form as the prior

hence, functional form for posterior is known

consequently, you just need to update its sufficient statistics

7/30/2019 Nyu Bayes1 2012

19/42

Wishart is a multivariate generalization of a chi-square.

Sufficient statistics for IW:

1 = 0 + 1 = 0 +

X( 0)( 0)

0

Deriving the updating rule

Start with the log prior

log() = (12)(0 + + 1) log || (12)(10)

Add to it the Gaussian log likelihood

log(|0) = ( 2) log || (12)X

( 0)0

1( 0)

7/30/2019 Nyu Bayes1 2012

20/42

The result is the log posterior kernel

log (| 0) = (12)( + 0 + + 1) log ||

(12)(10) (12)X

( 0)01( 0)

The last term on the right-hand side is

X(

0)01(

0) = hX

(

0)01(

0)i

= h1

X( 0)( 0)

0i

= h1

i

where

X

( 0)( 0)

0

Hence the log posterior kernel is

log (| 0) = (12)( + 0 + + 1) log ||

(12)(10) (12)(1)

=

(12)( + 0 + + 1) log ||

(12)[1(0 + )]

7/30/2019 Nyu Bayes1 2012

21/42

7/30/2019 Nyu Bayes1 2012

22/42

4.3 iid normal random vector with unknown mean and variance

This can be derived analytically, leading to a matrix variate t-distribution (e.g. see

Bauwens, et. al.)

But I want to describe an alternative approach that will be used later when discussingGibbs sampling.

The Gibbs sampler uses a models conditional densities to simulate its joint density.

For the joint distribution () the conditionals are (|) and (|)

The two previous examples describe each of the conditionals.

7/30/2019 Nyu Bayes1 2012

23/42

The Gibbs sampler simulates the joint posterior by drawing sequentially from

the conditionals (more later).

Thus we can decompose a complicated probability model into a collection of simple

ones. We will use this idea a lot.

7/30/2019 Nyu Bayes1 2012

24/42

4.4 A Regression Model with a Known Innovation Variance

= 0 +

is strictly exogenous and is (0 2)

is unknown, but () is known.

Prior: ( 2)

() |2|12 exp[(12)( )0(2)1( )]

Likelihood: | is conditionally Gaussian

(|2) exp[(12)X(

0)

2

2

7/30/2019 Nyu Bayes1 2012

25/42

Posterior: By following steps analogous to those for example (4.1), one can show

(| 2) = (1 1)

where

1 = (21 + 20)1

1 = 1(2

1

+ 2

0)

Once again, the posterior mean is a precision weighted average of prior and sample

information.

As the prior precision 1 0 this converges to the usual OLS result.

7/30/2019 Nyu Bayes1 2012

26/42

4.5 A Regression Model with Known

Same model, but now the innovation variance is unknown and the conditional mean

parameters are known.

This collapses to estimating the variance for a univariate iid process whose mean isknown to be zero.

Likelihood is

(| 2) exp

(X

( 0)

2

22

)

Prior: inverse-gamma density is conjugate to a normal likelihood

(2)

202

!2+2exp

2022

!

where represents prior degrees of freedom, and 20 is a prior scale parameter

7/30/2019 Nyu Bayes1 2012

27/42

To find the posterior kernel, multiply the prior and likelihood. After some algebra,

(2|) 2

02!2++2

exp

2

0+

22!

where

= X( 0)2The posterior degrees of freedom are the sum of prior and the number of obser-

vations,

= +

The posterior sum of squares is the sum of the prior sum of squares plus the sample

sum of squares,

= 20 +

To find the posterior, we just need to update the scale and degree of freedom

parameters.

2

7/30/2019 Nyu Bayes1 2012

28/42

4.6 A Regression Model with Unknown and 2

Use the Gibbs idea along with the results of the previous two subsections.

ff

7/30/2019 Nyu Bayes1 2012

29/42

4.7 Multivariate normal model with Jeffreys prior and known

variance

To reflect the absence of prior information about conditional mean parameters,

Jeffreys proposed

(|) ||(+1)2

where = dim()

This can be interpreted as the limit of the conjugate normal as 1

0 0

In this case, the posterior is

(|) ||(+1)2 exp[(12)X

( )01( )]

= ( 1)

This can be combined with a marginal prior for , using the Gibbs idea.

4 8 VAR i h J ff i (Z ll 1970)

7/30/2019 Nyu Bayes1 2012

30/42

4.8 VAR with Jeffreys prior (Zellner 1970)

() =

(0)

= constants plus VAR parameters

Prior:

() = (|)()

() =

(|) =

L k l h d f G

7/30/2019 Nyu Bayes1 2012

31/42

Likelihood function: Gaussian

Full conditionals:

(| ) ()(|)

This is the same as for the covariance matrix for a multivariate normal randomvector.

(| ) (|)(|)

= ||(++1)2 exp{(12)

X[()]

01[()]

This is the SUR criterion.

I f ll h i di i ll l i h d i

7/30/2019 Nyu Bayes1 2012

32/42

It follows that is conditionally normal with mean and variance

(| ) = [0(1 )]10(1 )

(| ) = [0(1 )]1

where and represent the left- and right-hand variables, respectively.

4 9 Probability of a binomial random variable

7/30/2019 Nyu Bayes1 2012

33/42

4.9 Probability of a binomial random variable

Suppose takes on 2 values, with probability and 1, respectively.

Likelihood:

(

|)

1(1

)

2

where 1 = number of draws of state 1, 2 = number of draws of state 2, and

1 + 2 = total number of draws.

Prior: The conjugate prior for a binomial likelihood is a beta denstity,

() 101(1)

201

To find the posterior kernel, multiply the likelihood and the prior,

(|) 1 +

101(1)

2 +

201

7/30/2019 Nyu Bayes1 2012

34/42

4 10 Probabilities for a multinomial random variable

7/30/2019 Nyu Bayes1 2012

35/42

4.10 Probabilities for a multinomial random variable

What if is discrete but takes on more than 2 values, with probabilities 1 2 ,

respectively?

Much like the previous case, but ...

likelihood is multinomial

conjugate prior is a Dirichlet density.

4 11 Transition Probabilities for a Markov-Switching Model

7/30/2019 Nyu Bayes1 2012

36/42

4.11 Transition Probabilities for a Markov-Switching Model

Suppose takes on 2 values, with transition probability matrix

=

" 1

1

#

Likelihood:

conditional on being in state at time , +1 is a binomial random variable.

By factoring the joint density for into the product of conditionals (and

ignoring the initial condition), the likelihood can be expressed as

(| ) 11 (1)

12

22 (1 )

21

Prior: are independent beta random variables

( ) = ()()

7/30/2019 Nyu Bayes1 2012

37/42

() 110 1(1)

120 1

() 220 1(1 )210 1

To find the posterior kernel, multiply prior and likelihood,

( |) 11 +

110 1(1)

12 +

120 1

22 +

220 1(1 )

21 +

210 1

(|)(|)

Hence the posterior is also a product of binomial densities.

4.12 Markov-switching models with more than 2 states

7/30/2019 Nyu Bayes1 2012

38/42

4.12 Markov switching models with more than 2 states

Follow the same steps as above, but use multinomial-Dirichlet probability models.

5 Examples of objects that Bayesian want to com-

7/30/2019 Nyu Bayes1 2012

39/42

5 Examples of objects that Bayesian want to com

pute

posterior mode: argmax (|) (analogous to MLE)

posterior mean

(|) =Z

(|)

Posterior variance

(|) =Z

[ (|)][ (|)]0(|)

other posterior moments, expected loss

(()) =Z

()(|)

Credible sets (confidence intervals)

7/30/2019 Nyu Bayes1 2012

40/42

C ed b e sets (co de ce te a s)

( |) = Z

(|)

(The set can be adjusted to deliver a pre-specified probability)

normalizing constants (marginal likelihood)

() =Z

(|)()

Bayes factors (analogous to likelihood ratio)

=(|1)

(|2)=

R(|1 1)(1)1R(|2 2)(2)2

(Remark on plug-in v. marginalization. Latter accounts better for parameter

uncertainty.)

Posterior odds ratios (model comparison)

=

(1)

(2)

(|1)

(|2) = prior odds

6 Bayesian computations

7/30/2019 Nyu Bayes1 2012

41/42

y p

6.1 Old School (Zellner 1970)

Form conjugate prior and likelihood.

Derive analytical expression for posterior distribution.

Evaluate integrals analytically

Some interesting models can be formulated this way, but this is a restrictive class.

6.2 New Wave

7/30/2019 Nyu Bayes1 2012

42/42

Can break out of conjugate families by resorting to asymptotic approximations andnumerical methods.

How to do that will be our main subject over the next few lectures.

Documents

Nyu Bayes1 2012