Upload
soumen-bose
View
220
Download
0
Embed Size (px)
Citation preview
7/30/2019 Nyu Bayes1 2012
1/42
Introduction to Bayesian Analysis
Timothy CogleyNYU Fall 2012
September 12, 2012
7/30/2019 Nyu Bayes1 2012
2/42
7/30/2019 Nyu Bayes1 2012
3/42
2 Classical v. Bayesian Perspectives
Common elements
=
() = true distribution
(|) = likelihood function for a parametric model
=
7/30/2019 Nyu Bayes1 2012
4/42
2.1 Classical perspective
There exists a true, fixed value of about which we want to learn.
We form an estimate e.g. by MLE.
We characterize the properties of by imagining how it would behave across re-peated samples.
Because varies across samples, we treat it as a random variable.
The distribution of across samples summarizes what we know about
Want estimators or test procedures that do well on average.
But concept of repeated samples is often just a convenient fiction; usually thereis only one sample.
Reliance on this concept gives rise to the term frequentist.
7/30/2019 Nyu Bayes1 2012
5/42
2.2 Bayesian perspective
Want to characterize subjective beliefs about
Posit that is a random variable.
Distributions over summarize subjective beliefs.
If is a r.v., what is ?
a deterministic function of the sample
e.g., for OLS, = (0)10is a deterministic function of
Only one can emerge in a given sample.
Conditional on there is no uncertainty about
7/30/2019 Nyu Bayes1 2012
6/42
Bayes theorem tells us how optimally to update beliefs after seeing data. I.e., it is
about the passage from prior to posterior beliefs.
Before looking at data
() =
After looking at data
(|) =
These judgments are mediated by a model of how and are related. A model
is a probability distribution over outcomes, (|)
7/30/2019 Nyu Bayes1 2012
7/42
Another way to distinguish Bayesian and classical perspectives is in terms ofcondi-
tioning.
Bayes: inference is conditional on the sample at hand.
Classical: inference depends on what could happen in other, hypothetical sam-
ples that could be drawn from the same mechanism.
My view: although there are interesting philosophical differences, I regard them as
alternative tools. Sometimes one is more convenient, sometimes another.
One reason why I like Bayesian methods: they connect well with Bayesian decisiontheory. For policy-oriented macroeconomists, this is a big advantage.
7/30/2019 Nyu Bayes1 2012
8/42
3 Bayes Theorem
Can factor a joint density into the product of a conditional and a marginal
( ) = (|)() = (|)()
Can also marginalize a joint density by integrating wrt other argument
() =Z
( )
Bayes theorem follows from these two facts
Start with the joint density ( )
Can factor in two ways
( ) = (|)() = (|)()
7/30/2019 Nyu Bayes1 2012
9/42
Divide through by ()
(|) =
(|)()
()
The marginal () can also be expressed as
() =Z
( ) =Z
(|)()
Hence the posterior is
(|) =(|)()R(|)()
(|)()
7/30/2019 Nyu Bayes1 2012
10/42
Terminology:
( ) = (|)() is known as the posterior kernel
() = R(|)() is known as the normalizing constant or marginallikelihood
(|) is known as the posterior density.
7/30/2019 Nyu Bayes1 2012
11/42
Remarks:
Notice that() does not depend on All the information about is contained
in the kernel
The posterior kernel does not necessarily integrate to 1, so it is not a proper
probability density.
When we divide the kernel by the normalizing constant, we ensure that theresult does integrate to 1.
Important to verify that () exists
it does when priors are proper, i.e. when () integrates to 1
problems sometimes arise when () is improper
7/30/2019 Nyu Bayes1 2012
12/42
4 Examples
4.1 Mean of iid normal random vector with known variance
Suppose () with known but unknown
Prior: () (00) (Interpret prior precision for the univariate case)
(|0) exp[(12)( 0)010 ( 0)]
Likelihood:
(|0) exp[(12)X
( )
010 ( )]
Posterior Kernel: multiply the prior and likelihood
7/30/2019 Nyu Bayes1 2012
13/42
(|0) (|0)(|0)
= exp{(12)[( 0)010 ( 0) +
X( )
010 ( )]} (1)
Notice that this is a quadratic within an exponential, making it a Gaussian kernel.
It follows that we can express the posterior kernel as
(|0) exp[(12)( 1)011 ( 1)] (2)
To find expressions for 11 expand (1) and equate powers with (2).
( 0)010 ( 0) =
010 +
0010 0
010 0
0010
7/30/2019 Nyu Bayes1 2012
14/42
X( )
010 ( ) =
X[0
10 +
010
010
010 ]
Notice that
X
010 = 010 X
0
10 =
010 X
010 =
010
Collecting terms in
( 0)010 ( 0) + P( )
010 ( )] =
= 0[10 + 10 ]
0[10 0 + 10 ] [
0010 +
10 ]
+ terms not involving
7/30/2019 Nyu Bayes1 2012
15/42
Next, equate powers with terms in (2)
011 = 0[10 +
10 ] 1 = [
10 +
10 ]
1
011 1 = 0[10 0 + 10 ]
11 1 = [
10 0 +
10 ]
1 = 1[10 0 +
10 ]
= [10 + 10 ]1[10 0 +
10 ]
7/30/2019 Nyu Bayes1 2012
16/42
Interpretation:
The posterior mean is a variance-weighted average of the prior mean and the
sample average.
The relative weights depend on the prior variance 1
0
the sample size
and the variance of (i.e. 0).
If the prior is diffuse (10 = 0) the posterior mean simplifies to
1 = [
10 ]
1[
10 ] =
Otherwise the prior pulls 1 away from toward 0
Hence the posterior mean is biased. (Is that a bad thing?)
7/30/2019 Nyu Bayes1 2012
17/42
Similarly, the posterior precision 11 is a variance-weighted average of the prior
and sample precisions.
If the prior is diffuse, the posterior variance simplifies to
1 = [10 ]1 = 10
the variance of the sample mean.
If the prior adds information (10 = 0) it reduces the posterior variance.
7/30/2019 Nyu Bayes1 2012
18/42
4.2 Variance of iid normal random vector with known mean
Now suppose () with known = 0 but unknown
Prior: for a Gaussian likelihood, the conjugate prior is inverse Wishart
() = (10 0)
||(0++1)2 exp{(12)(10)}
where 0 is the prior degrees of freedom, 0 is the prior scale or sum of squares
matrix, and = dim()
Conjugate: the posterior has the same functional form as the prior
hence, functional form for posterior is known
consequently, you just need to update its sufficient statistics
7/30/2019 Nyu Bayes1 2012
19/42
Wishart is a multivariate generalization of a chi-square.
Sufficient statistics for IW:
1 = 0 + 1 = 0 +
X( 0)( 0)
0
Deriving the updating rule
Start with the log prior
log() = (12)(0 + + 1) log || (12)(10)
Add to it the Gaussian log likelihood
log(|0) = ( 2) log || (12)X
( 0)0
1( 0)
7/30/2019 Nyu Bayes1 2012
20/42
The result is the log posterior kernel
log (| 0) = (12)( + 0 + + 1) log ||
(12)(10) (12)X
( 0)01( 0)
The last term on the right-hand side is
X(
0)01(
0) = hX
(
0)01(
0)i
= h1
X( 0)( 0)
0i
= h1
i
where
X
( 0)( 0)
0
Hence the log posterior kernel is
log (| 0) = (12)( + 0 + + 1) log ||
(12)(10) (12)(1)
=
(12)( + 0 + + 1) log ||
(12)[1(0 + )]
7/30/2019 Nyu Bayes1 2012
21/42
7/30/2019 Nyu Bayes1 2012
22/42
4.3 iid normal random vector with unknown mean and variance
This can be derived analytically, leading to a matrix variate t-distribution (e.g. see
Bauwens, et. al.)
But I want to describe an alternative approach that will be used later when discussingGibbs sampling.
The Gibbs sampler uses a models conditional densities to simulate its joint density.
For the joint distribution () the conditionals are (|) and (|)
The two previous examples describe each of the conditionals.
7/30/2019 Nyu Bayes1 2012
23/42
The Gibbs sampler simulates the joint posterior by drawing sequentially from
the conditionals (more later).
Thus we can decompose a complicated probability model into a collection of simple
ones. We will use this idea a lot.
7/30/2019 Nyu Bayes1 2012
24/42
4.4 A Regression Model with a Known Innovation Variance
= 0 +
is strictly exogenous and is (0 2)
is unknown, but () is known.
Prior: ( 2)
() |2|12 exp[(12)( )0(2)1( )]
Likelihood: | is conditionally Gaussian
(|2) exp[(12)X(
0)
2
2
7/30/2019 Nyu Bayes1 2012
25/42
Posterior: By following steps analogous to those for example (4.1), one can show
(| 2) = (1 1)
where
1 = (21 + 20)1
1 = 1(2
1
+ 2
0)
Once again, the posterior mean is a precision weighted average of prior and sample
information.
As the prior precision 1 0 this converges to the usual OLS result.
7/30/2019 Nyu Bayes1 2012
26/42
4.5 A Regression Model with Known
Same model, but now the innovation variance is unknown and the conditional mean
parameters are known.
This collapses to estimating the variance for a univariate iid process whose mean isknown to be zero.
Likelihood is
(| 2) exp
(X
( 0)
2
22
)
Prior: inverse-gamma density is conjugate to a normal likelihood
(2)
202
!2+2exp
2022
!
where represents prior degrees of freedom, and 20 is a prior scale parameter
7/30/2019 Nyu Bayes1 2012
27/42
To find the posterior kernel, multiply the prior and likelihood. After some algebra,
(2|) 2
02!2++2
exp
2
0+
22!
where
= X( 0)2The posterior degrees of freedom are the sum of prior and the number of obser-
vations,
= +
The posterior sum of squares is the sum of the prior sum of squares plus the sample
sum of squares,
= 20 +
To find the posterior, we just need to update the scale and degree of freedom
parameters.
2
7/30/2019 Nyu Bayes1 2012
28/42
4.6 A Regression Model with Unknown and 2
Use the Gibbs idea along with the results of the previous two subsections.
ff
7/30/2019 Nyu Bayes1 2012
29/42
4.7 Multivariate normal model with Jeffreys prior and known
variance
To reflect the absence of prior information about conditional mean parameters,
Jeffreys proposed
(|) ||(+1)2
where = dim()
This can be interpreted as the limit of the conjugate normal as 1
0 0
In this case, the posterior is
(|) ||(+1)2 exp[(12)X
( )01( )]
= ( 1)
This can be combined with a marginal prior for , using the Gibbs idea.
4 8 VAR i h J ff i (Z ll 1970)
7/30/2019 Nyu Bayes1 2012
30/42
4.8 VAR with Jeffreys prior (Zellner 1970)
() =
(0)
= constants plus VAR parameters
Prior:
() = (|)()
() =
(|) =
L k l h d f G
7/30/2019 Nyu Bayes1 2012
31/42
Likelihood function: Gaussian
Full conditionals:
(| ) ()(|)
This is the same as for the covariance matrix for a multivariate normal randomvector.
(| ) (|)(|)
= ||(++1)2 exp{(12)
X[()]
01[()]
This is the SUR criterion.
I f ll h i di i ll l i h d i
7/30/2019 Nyu Bayes1 2012
32/42
It follows that is conditionally normal with mean and variance
(| ) = [0(1 )]10(1 )
(| ) = [0(1 )]1
where and represent the left- and right-hand variables, respectively.
4 9 Probability of a binomial random variable
7/30/2019 Nyu Bayes1 2012
33/42
4.9 Probability of a binomial random variable
Suppose takes on 2 values, with probability and 1, respectively.
Likelihood:
(
|)
1(1
)
2
where 1 = number of draws of state 1, 2 = number of draws of state 2, and
1 + 2 = total number of draws.
Prior: The conjugate prior for a binomial likelihood is a beta denstity,
() 101(1)
201
To find the posterior kernel, multiply the likelihood and the prior,
(|) 1 +
101(1)
2 +
201
7/30/2019 Nyu Bayes1 2012
34/42
4 10 Probabilities for a multinomial random variable
7/30/2019 Nyu Bayes1 2012
35/42
4.10 Probabilities for a multinomial random variable
What if is discrete but takes on more than 2 values, with probabilities 1 2 ,
respectively?
Much like the previous case, but ...
likelihood is multinomial
conjugate prior is a Dirichlet density.
4 11 Transition Probabilities for a Markov-Switching Model
7/30/2019 Nyu Bayes1 2012
36/42
4.11 Transition Probabilities for a Markov-Switching Model
Suppose takes on 2 values, with transition probability matrix
=
" 1
1
#
Likelihood:
conditional on being in state at time , +1 is a binomial random variable.
By factoring the joint density for into the product of conditionals (and
ignoring the initial condition), the likelihood can be expressed as
(| ) 11 (1)
12
22 (1 )
21
Prior: are independent beta random variables
( ) = ()()
7/30/2019 Nyu Bayes1 2012
37/42
() 110 1(1)
120 1
() 220 1(1 )210 1
To find the posterior kernel, multiply prior and likelihood,
( |) 11 +
110 1(1)
12 +
120 1
22 +
220 1(1 )
21 +
210 1
(|)(|)
Hence the posterior is also a product of binomial densities.
4.12 Markov-switching models with more than 2 states
7/30/2019 Nyu Bayes1 2012
38/42
4.12 Markov switching models with more than 2 states
Follow the same steps as above, but use multinomial-Dirichlet probability models.
5 Examples of objects that Bayesian want to com-
7/30/2019 Nyu Bayes1 2012
39/42
5 Examples of objects that Bayesian want to com
pute
posterior mode: argmax (|) (analogous to MLE)
posterior mean
(|) =Z
(|)
Posterior variance
(|) =Z
[ (|)][ (|)]0(|)
other posterior moments, expected loss
(()) =Z
()(|)
Credible sets (confidence intervals)
7/30/2019 Nyu Bayes1 2012
40/42
C ed b e sets (co de ce te a s)
( |) = Z
(|)
(The set can be adjusted to deliver a pre-specified probability)
normalizing constants (marginal likelihood)
() =Z
(|)()
Bayes factors (analogous to likelihood ratio)
=(|1)
(|2)=
R(|1 1)(1)1R(|2 2)(2)2
(Remark on plug-in v. marginalization. Latter accounts better for parameter
uncertainty.)
Posterior odds ratios (model comparison)
=
(1)
(2)
(|1)
(|2) = prior odds
6 Bayesian computations
7/30/2019 Nyu Bayes1 2012
41/42
y p
6.1 Old School (Zellner 1970)
Form conjugate prior and likelihood.
Derive analytical expression for posterior distribution.
Evaluate integrals analytically
Some interesting models can be formulated this way, but this is a restrictive class.
6.2 New Wave
7/30/2019 Nyu Bayes1 2012
42/42
Can break out of conjugate families by resorting to asymptotic approximations andnumerical methods.
How to do that will be our main subject over the next few lectures.