Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Bayes Factors, posterior predictives, short intro to
RJMCMC
© Dave Campbell 2016
Thermodynamic Integration
Bayesian Statistical Inference
P(θ∣Y )∝ P(Y∣θ)π (θ)
Once you have posterior samples you can compute the predictive distribution of future observations:
P(Ynew∣θ,Yold )
To do this you sample a from
(Sample 1 value from your collection of posterior samples)
Generate simulated data from the likelihood:
Repeat for a large sample of from to get at the posterior predictive distribution
P(Ynew∣θ*)
P(θ∣Y )θ*
P(θ∣Y )θ*
Posterior predictive distribution:
No need to use asymptotic normal assumptions or a single point and variance estimate for
Any shaped distribution on naturally feeds it’s entire distribution through to the data generating process!
θ*
P(θ∣Y )
Obtaining is related to obtaining a set of fake data samples for parametric bootstrap, except that the distribution assumption on the parameters doesn’t require asymptotic arguments.
P(Ynew∣θ,Yold )
Uses:
Another diagnostic tool; Obtain a sample from and see if it is similar to the data.
Use the posterior predictive distribution for sequential experimental design: Choose the new covariate points that optimize some criterion.
P(Ynew∣θ,Yold )
Hypothesis testing ; model comparison
Ultimately we want inference on
But computing the marginal likelihood is difficult.
P(M∣Y )
P(M∣Y ) =P(Y∣θM )π (θ)
Θ∫ π (M )dθ
P(Y )
Usually Bayesians make model decisions through Bayes Factors
B12 (y) =w1(y)w2 (y)
w(y) = π (θ) f (y∣θ)dθΘ∫
Bayes Factor interpretation
B12 (y) =w1(y)w2 (y)
w(y) = π (θ) f (y∣θ)dθΘ∫
The odds ratio for two models:
posterior odds = Bayes Factor X prior odds
Uniform prior odds across models implies that
posterior odds = Bayes Factor
posterior odds = Bayes Factor X prior odds
So the Bayes factor is the amount of evidence for one model compared to another.
Bf = the change in odds when moving from the prior to the posterior
Recall: P(θ∣Y ) = P(y∣θ)P(θ)P(y)
P(Y ) = P(y∣θ)P(θ)dθ∫
Newton & Raftery (1994)
P(θ∣Y ) = P(y∣θ)P(θ)P(y)
P(Y )P(θ∣Y )P(y∣θ)
= P(θ)
P(Y ) P(θ∣Y )P(y∣θ)
dθ∫ = P(θ)dθ = 1∫
Newton & Raftery (1994)
P(Y ) P(θ∣Y )P(y∣θ)
dθ∫ = 1
E 1P(y∣θ)⎡
⎣⎢
⎤
⎦⎥P(θ∣Y )
=1
P(Y )
And estimated P(Y) by P̂ (Y ) =
"1
N
NX
i=1
1
P (y | ✓)
#�1
Newton & Raftery (1994)
Compute this by calculating the likelihood for each value of that was obtained from the posterior sampling step
θi
P̂ (Y ) =
"1
N
NX
i=1
1
P (y | ✓)
#�1
Newton & Raftery (1994)
The harmonic mean estimator is very very very very very sensitive to outliers with extremely small values of
But it is asymptotically unbiased
Estimate P(Y) by
P(y∣θ)
P̂ (Y ) =
"1
N
NX
i=1
1
P (y | ✓)
#�1
Calderhead and Girolami (2009) showed that the harmonic mean estimator is can be massively biased for finite samples
Thermodynamic Integration Friel, N., Pettitt, A., 2008. Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society: Series B 70 (3)
Calderhead, Ben, and Mark Girolami. "Estimating Bayes Factors Via Thermodynamic Integration and Population MCMC." Computational Statistics and Data Analysis 53 (2009)
In Parallel Tempering we sample from
But we can get the marginal likelihood via:
Pm (θ∣Y ) =P(y∣θ)βm P(θ)
Pm (y)
log(p(Y )) = log p(Y∣θ)⎡⎣ ⎤⎦Pm (θ∣Y )dθ dβ∫0
1
∫
log(p(Y )) = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y ) dβ0
1
∫
Compute via 1-dimensional quadrature over the temperature!
log(p(Y )) = 12
βm − βm−1( ) Em + Em−1[ ]m∑
log(p(Y )) = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y ) dβ0
1
∫
Em = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y )
To compute log(marginal likelihoods) all we need is to define a good grid for temperatures
Calderhead and Girolami (2009) suggest
log(p(Y )) = 12
βm − βm−1( ) Em + Em−1[ ]m∑
Em = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y )
β =seq( from = 1,to = N )
N⎛⎝⎜
⎞⎠⎟
5
Parallel Tempering To the Extreme!
R Studio plots for the Galaxy data set (3 groups, density of one of the mean parameters vs temperature
Parallel Tempering densitiesThat dip just before
temperature β = 1 is real. It is caused by the introduction
of new modes
Compare the 3 group Galaxy to the 6 group galaxy.
Show plots of mean density vs temperature
25,000 iterations with 30 parallel chains
B12 (y) =w1(y)w2 (y)
Now, back to RStudio to compare the Galaxy data with k=3 groups vs k=6 groups.
(the result: there is decisive evidence that the k=3 groups model is better)
Alternative to Bayes Factors: RJMCMC
MODEL POSTERIOR PROBABILITY
Likelihood:
Parameter Prior:
Model Prior: for
The marginal posterior probability of a model is helpful when the answer is not clear
P(Y∣θ j ,M j )
P(θ j∣M j )
P(M j∣Ω)
P(M j∣Y ,Ω) =P(Y∣θ j ,M j ,Ω)P(θ j∣M j ,Ω)P(M j∣Ω)dθ j∫
P(Y )
M j ∈Ω
P(M j∣Y ,Ω) = P(θ j ,M j∣Y ,Ω)dθ j∫
Our goal is to get in a single MCMC chain
even if contains a lot of models
We need simulation methods that sample across models.
P(M j∣Y ,Ω) = P(θ j ,M j∣Y ,Ω)dθ j∫
Ω
We can avoid extensive MCMC for each model and instead sample from directly!
We just adjust MCMC so at each iteration we:
1. Sample j, i.e. choose a model Mnew
2. Then propose a from Mnew
3. Keep Mnew and with probability
REVERSIBLE JUMP MCMC
P(M j∣Y ,Ω)
θnew
α = min P(Y |θnew ,Mnew )P(θnew ,Mnew )Pnew (vnew )P(Y |θold ,Mold )P(θ old
,Mold )Pold (vold )Jold,new ,1
⎛
⎝⎜
⎞
⎠⎟
θnew
Biometrika (1995), 82, 4, pp. 711-32
V
We use auxiliary variables to augment the dimension space so that dim(Mold) = dim (Mnew)
v
α = min P(Y |θnew ,Mnew )P(θnew ,Mnew )Pnew (vnew )P(Y |θold ,Mold )P(θ old
,Mold )Pold (vold )Jold,new ,1
⎛
⎝⎜
⎞
⎠⎟
JACOBIANWe need the Jacobian for the transformation
And the proposed values needs to allow the possibility of being accepted.
J =
∂θold ,1∂θnew,1
...∂θold ,1∂θnew, pnew
...∂θold ,1∂vnew
M O M M∂θold , pold∂θnew,1
...∂θold , pold∂θnew, pnew
...
M M O M∂vold∂θnew,1
... ... ∂vold∂vnew
⎡
⎣
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
θnew
. . .
.... . .
...
...
...
M1 and M2 have different parameter dimensions
Often model parameters don’t have an obvious a transformation allowing an intuitive transition
The last accepted value might be from a different model and may require a large jump in the parameter space.
POTENTIAL PROBLEMS
M1:
M2:
Moving from M1 to M2 to will require moving β1,0
quite far to get to a reasonable location for β2,0
Y ⇠ N(�2,0 + �2,1X + �2,2X2,�2
2)
Y ⇠ N(�1,0 + �1,1X,�21)
M1: Galaxy with 3 Gaussians
M2: Galaxy with 4 Gaussians
Moving from M1 to M2 can be done by dividing one of the current Gaussians. From M2 to M1 can be done through merging 2 components
RJMCMC: Beautiful in principle, nasty in practice
Needs: transition function between parameters in multiple model spaces.
Efficiency depends completely on this functional choice and the distribution for the auxiliary variables.
Works well when we can use birth / death process (change-point analysis).