Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Bayes Factors, posterior predictives, short intro to

RJMCMC

© Dave Campbell 2016

Thermodynamic Integration

Bayesian Statistical Inference

P(θ∣Y )∝ P(Y∣θ)π (θ)

Once you have posterior samples you can compute the predictive distribution of future observations:

P(Ynew∣θ,Yold )

To do this you sample a from

(Sample 1 value from your collection of posterior samples)

Generate simulated data from the likelihood:

Repeat for a large sample of from to get at the posterior predictive distribution

P(Ynew∣θ*)

P(θ∣Y )θ*

P(θ∣Y )θ*

Posterior predictive distribution:

No need to use asymptotic normal assumptions or a single point and variance estimate for

Any shaped distribution on naturally feeds it’s entire distribution through to the data generating process!

θ*

P(θ∣Y )

Obtaining is related to obtaining a set of fake data samples for parametric bootstrap, except that the distribution assumption on the parameters doesn’t require asymptotic arguments.

P(Ynew∣θ,Yold )

Uses:

Another diagnostic tool; Obtain a sample from and see if it is similar to the data.

Use the posterior predictive distribution for sequential experimental design: Choose the new covariate points that optimize some criterion.

P(Ynew∣θ,Yold )

Hypothesis testing ; model comparison

Ultimately we want inference on

But computing the marginal likelihood is difficult.

P(M∣Y )

P(M∣Y ) =P(Y∣θM )π (θ)

Θ∫ π (M )dθ

P(Y )

Usually Bayesians make model decisions through Bayes Factors

B12 (y) =w1(y)w2 (y)

w(y) = π (θ) f (y∣θ)dθΘ∫

Bayes Factor interpretation

B12 (y) =w1(y)w2 (y)

w(y) = π (θ) f (y∣θ)dθΘ∫

The odds ratio for two models:

posterior odds = Bayes Factor X prior odds

Uniform prior odds across models implies that

posterior odds = Bayes Factor

posterior odds = Bayes Factor X prior odds

So the Bayes factor is the amount of evidence for one model compared to another.

Bf = the change in odds when moving from the prior to the posterior

Recall: P(θ∣Y ) = P(y∣θ)P(θ)P(y)

P(Y ) = P(y∣θ)P(θ)dθ∫

Newton & Raftery (1994)

P(θ∣Y ) = P(y∣θ)P(θ)P(y)

P(Y )P(θ∣Y )P(y∣θ)

= P(θ)

P(Y ) P(θ∣Y )P(y∣θ)

dθ∫ = P(θ)dθ = 1∫


P(Y ) P(θ∣Y )P(y∣θ)

dθ∫ = 1

E 1P(y∣θ)⎡

⎣⎢

⎤

⎦⎥P(θ∣Y )

=1

P(Y )

And estimated P(Y) by P̂ (Y ) =

"1

N

NX

i=1

1

P (y | ✓)

#�1


Compute this by calculating the likelihood for each value of that was obtained from the posterior sampling step

θi

P̂ (Y ) =

"1

N

NX

i=1

1

P (y | ✓)

#�1


The harmonic mean estimator is very very very very very sensitive to outliers with extremely small values of

But it is asymptotically unbiased

Estimate P(Y) by

P(y∣θ)

P̂ (Y ) =

"1

N

NX

i=1

1

P (y | ✓)

#�1

Calderhead and Girolami (2009) showed that the harmonic mean estimator is can be massively biased for finite samples

Thermodynamic Integration Friel, N., Pettitt, A., 2008. Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society: Series B 70 (3)

Calderhead, Ben, and Mark Girolami. "Estimating Bayes Factors Via Thermodynamic Integration and Population MCMC." Computational Statistics and Data Analysis 53 (2009)

In Parallel Tempering we sample from

But we can get the marginal likelihood via:

Pm (θ∣Y ) =P(y∣θ)βm P(θ)

Pm (y)

log(p(Y )) = log p(Y∣θ)⎡⎣ ⎤⎦Pm (θ∣Y )dθ dβ∫0

1

∫

log(p(Y )) = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y ) dβ0

1

∫

Compute via 1-dimensional quadrature over the temperature!

log(p(Y )) = 12

βm − βm−1( ) Em + Em−1[ ]m∑

log(p(Y )) = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y ) dβ0

1

∫

Em = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y )

To compute log(marginal likelihoods) all we need is to define a good grid for temperatures

Calderhead and Girolami (2009) suggest

log(p(Y )) = 12

βm − βm−1( ) Em + Em−1[ ]m∑

Em = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y )

β =seq( from = 1,to = N )

N⎛⎝⎜

⎞⎠⎟

5

Parallel Tempering To the Extreme!

R Studio plots for the Galaxy data set (3 groups, density of one of the mean parameters vs temperature

Parallel Tempering densitiesThat dip just before

temperature β = 1 is real. It is caused by the introduction

of new modes

Compare the 3 group Galaxy to the 6 group galaxy.

Show plots of mean density vs temperature

25,000 iterations with 30 parallel chains

B12 (y) =w1(y)w2 (y)

Now, back to RStudio to compare the Galaxy data with k=3 groups vs k=6 groups.

(the result: there is decisive evidence that the k=3 groups model is better)

Alternative to Bayes Factors: RJMCMC

MODEL POSTERIOR PROBABILITY

Likelihood:

Parameter Prior:

Model Prior: for

The marginal posterior probability of a model is helpful when the answer is not clear

P(Y∣θ j ,M j )

P(θ j∣M j )

P(M j∣Ω)

P(M j∣Y ,Ω) =P(Y∣θ j ,M j ,Ω)P(θ j∣M j ,Ω)P(M j∣Ω)dθ j∫

P(Y )

M j ∈Ω

P(M j∣Y ,Ω) = P(θ j ,M j∣Y ,Ω)dθ j∫

Our goal is to get in a single MCMC chain

even if contains a lot of models

We need simulation methods that sample across models.

P(M j∣Y ,Ω) = P(θ j ,M j∣Y ,Ω)dθ j∫

Ω

We can avoid extensive MCMC for each model and instead sample from directly!

We just adjust MCMC so at each iteration we:

1. Sample j, i.e. choose a model Mnew

2. Then propose a from Mnew

3. Keep Mnew and with probability

REVERSIBLE JUMP MCMC

P(M j∣Y ,Ω)

θnew

α = min P(Y |θnew ,Mnew )P(θnew ,Mnew )Pnew (vnew )P(Y |θold ,Mold )P(θ old

,Mold )Pold (vold )Jold,new ,1

⎛

⎝⎜

⎞

⎠⎟

θnew

Biometrika (1995), 82, 4, pp. 711-32

V

We use auxiliary variables to augment the dimension space so that dim(Mold) = dim (Mnew)

v

α = min P(Y |θnew ,Mnew )P(θnew ,Mnew )Pnew (vnew )P(Y |θold ,Mold )P(θ old

,Mold )Pold (vold )Jold,new ,1

⎛

⎝⎜

⎞

⎠⎟

JACOBIANWe need the Jacobian for the transformation

And the proposed values needs to allow the possibility of being accepted.

J =

∂θold ,1∂θnew,1

...∂θold ,1∂θnew, pnew

...∂θold ,1∂vnew

M O M M∂θold , pold∂θnew,1

...∂θold , pold∂θnew, pnew

...

M M O M∂vold∂θnew,1

... ... ∂vold∂vnew

⎡

⎣

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

θnew

. . .

.... . .

...

...

...

M1 and M2 have different parameter dimensions

Often model parameters don’t have an obvious a transformation allowing an intuitive transition

The last accepted value might be from a different model and may require a large jump in the parameter space.

POTENTIAL PROBLEMS

M1:

M2:

Moving from M1 to M2 to will require moving β1,0

quite far to get to a reasonable location for β2,0

Y ⇠ N(�2,0 + �2,1X + �2,2X2,�2

2)

Y ⇠ N(�1,0 + �1,1X,�21)

M1: Galaxy with 3 Gaussians

M2: Galaxy with 4 Gaussians

Moving from M1 to M2 can be done by dividing one of the current Gaussians. From M2 to M1 can be done through merging 2 components

RJMCMC: Beautiful in principle, nasty in practice

Needs: transition function between parameters in multiple model spaces.

Efficiency depends completely on this functional choice and the distribution for the auxiliary variables.

Works well when we can use birth / death process (change-point analysis).

Documents

Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption