Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Simulation - Lectures 8 - MCMC: GibbsLecture version: Monday 9th March, 2020, 12:22
Robert Davies
Part A Simulation and Statistical Programming
Hilary Term 2020
Part A Simulation. HT 2020. R. Davies. 1 / 28
Outline
Metropolis Hastings
Gibbs sampler
Part A Simulation. HT 2020. R. Davies. 2 / 28
Example: Gaussian mixture model
I Let p(x) =∑K
k=1 πkpk(x), where π is a vector of mixtureproportions, and pk is a normal density with mean µk and variance σ2k
I Here, suppose π1 = 0.5 with µ1 = 3, µ2 = 5 and σ1 = σ2 = 1
I MH algorithm with target pdf p and proposal transition pdf
q(y|x) =
1 for y ∈ [x− 1/2, x+ 1/2]0 otherwise
I Acceptance probability
α(y|x) = min
(1,p(y)q(x|y)
p(x)q(y|x)
)= min (1, p(y)/p(x))
Part A Simulation. HT 2020. R. Davies. 3 / 28
Code
set.seed(9110)
p <- function(x)
0.5 * dnorm(x, mean = 3) + 0.5 * dnorm(x, mean = 5)
n <- 10000
x <- numeric(n)
x[1] <- 4
for(t in 1:(n - 1))
yt <- x[t] + (runif(1) - 0.5)
if (runif(1) < (p(yt) / p(x[t])))
x[t + 1] <- yt
else
x[t + 1] <- x[t]
Part A Simulation. HT 2020. R. Davies. 4 / 28
0 2000 4000 6000 8000 10000
24
68
t
x[t]
Histogram of x[t]
2 4 6 8
010
020
030
040
050
060
0
Joint DensityDensity 1Density 2
Figure: µ1 = 3, µ2 = 5
Part A Simulation. HT 2020. R. Davies. 5 / 28
0 2000 4000 6000 8000 10000
24
68
10
t
x[t]
Histogram of x[t]
2 4 6 8 10
010
020
030
040
050
0
Joint DensityDensity 1Density 2
Figure: µ1 = 3, µ2 = 7
Part A Simulation. HT 2020. R. Davies. 6 / 28
0 2000 4000 6000 8000 10000
05
1015
t
x[t]
Histogram of x[t]
0 5 10 15
010
020
030
040
0
Joint DensityDensity 1Density 2
Figure: µ1 = 3, µ2 = 13
Part A Simulation. HT 2020. R. Davies. 7 / 28
Bivariate normal exampleI Consider a Metropolis algorithm for simulating two-dimensional
samples Zt = (Xt, Yt) from a bivariate Normal distribution with meanµ = (0, 0) and covariance
Σ =
(1 0.7
√2
0.7√
2 2
)I Using a proposal distribution that considers only the current value for
that dimension
Xt+1|Xt = x, Yt ∼ U(x− w, x+ w) i ∈ 1, 2
I The performance of the algorithm can be controlled by setting wI If w is small then we propose smaller moves and the chain will move
slowlyI If w is large then we propose larger moves and may accept only a few
movesI There is an ‘art’ to implementing MCMC that involves choice of good
proposal distributions.Part A Simulation. HT 2020. R. Davies. 8 / 28
Example
N = 4 w = 0.1 T = 100
−4 −2 0 2 4
−4−2
02
4
Part A Simulation. HT 2020. R. Davies. 9 / 28
Example
N = 4 w = 0.1 T = 1000
−4 −2 0 2 4
−4−2
02
4
Part A Simulation. HT 2020. R. Davies. 10 / 28
Example
N = 4 w = 0.1 T = 10000
−4 −2 0 2 4
−4−2
02
4
Part A Simulation. HT 2020. R. Davies. 11 / 28
Example
N = 4 w = 0.01 T = 10000
−4 −2 0 2 4
−4−2
02
4
Part A Simulation. HT 2020. R. Davies. 12 / 28
Example
N = 4 w = 0.01 T = 1e+05
−4 −2 0 2 4
−4−2
02
4
Part A Simulation. HT 2020. R. Davies. 13 / 28
Example
N = 4 w = 10 T = 1000
−4 −2 0 2 4
−4−2
02
4
Part A Simulation. HT 2020. R. Davies. 14 / 28
MCMC: Practical aspects
I The MCMC chain does not start from the stationary distribution, soE[φ(Xt)] 6= E[φ(X)] and the difference can be significant for small t
I Xt converges in distribution to p as t→∞I Common practice is to discard the b first values of the Markov chainX0, . . . , Xb−1, where we assume that Xb is approximately distributedfrom p
I We use the estimator1
n− b
n−1∑t=b
φ(Xt)
I The initial X0, . . . , Xb−1 is called the burn-in period of the MCMCchain.
Part A Simulation. HT 2020. R. Davies. 15 / 28
Burn in exampleI Consider the first example of this lecture (GMM) for θ = E[X] forµ1 = 3, µ2 = 5.
I Let X0 = −5. This is an unlikely start p(−5) = 2× 10−15
I Note that in real world, choosing good start points is non-trivialI Below, estimates of θ wrt burn-in across chains given fixed n = 2000
0 100 200 300 400 500
3.0
3.5
4.0
4.5
5.0
Burn in
thet
a
averageper−chain
Part A Simulation. HT 2020. R. Davies. 16 / 28
Outline
Metropolis Hastings
Gibbs sampler
Part A Simulation. HT 2020. R. Davies. 17 / 28
MCMC: Gibbs Sampler
I The Gibbs sampling algorithm is a Markov Chain Monte Carlo(MCMC) algorithm to simulate a Markov chain with a givenstationary distribution
I Unlike with MCMC Metropolis Hastings, where we use a proposal andacceptance / rejection, in Gibbs sampling, we make use of conditionaldistributions
I In the proof that follows we will assume a discrete distribution, but itfollows for continuous distributions (unproved)
I Note we will use notation x−j = (x1, ..., xj−1, xj+1, . . . , xd toindicate that the entries of x minutes the jth entry
I Further note we will use notation like xtj to refer to the jth entry ofxt, where xt is the t+ 1st realization / sample of the Markov chain
Part A Simulation. HT 2020. R. Davies. 18 / 28
Gibbs Sampler algorithm
Gibbs Sampler algorithm
1. Let p be the target pmf with conditional p(xtj |xt−j) for x with ddimensions
2. Either set X0 = x0, or draw X0 from some initial distribution
3. For t = 1, 2, . . . , n− 1:
3.1 Assume Xt−1 = xt−1, and set Xt = xt−1
3.2 Choose a permutation S of 1, ..., d3.3 For j ∈ S:
3.3.1 Sample Xtj from p(xt
j |xt−j)
Part A Simulation. HT 2020. R. Davies. 19 / 28
Gibbs Sampler about
I The Gibbs sample defines a Markov chain with transition matrix P ,such that, for x, y ∈ Ω with xi = yi ∀i 6= j, with xj = a and yj = b,that
Px,y = p(xj = b|x−j) =P (x1, . . . , xj = b, . . .)∑a∗ P (x1, ..., xj = a∗, ...)
Part A Simulation. HT 2020. R. Davies. 20 / 28
Gibbs Sampler theorem and proof
Theorem
The transition matrix P of the Markov chain generated by the Gibbssampler is reversible with respect to p and therefore admits p as astationary distribution.
I Proof: We check detailed balance. For x 6= y ∈ Ω withxi = yi ∀i 6= j, with xj = a and yj = b, that
p(x)Px,y = p(x)p(xj = b|x−j)
= p(x1..., xj = a, ...)P (x1, . . . , xj = b, . . .)∑b∗ P (x1, ..., xj = b∗, ...)
= p(x1..., xj = b, ...)P (x1, . . . , xj = a, . . .)∑a∗ P (x1, ..., xj = a∗, ...)
= p(y)p(xj = a|x−j)
= p(y)Py,x
Part A Simulation. HT 2020. R. Davies. 21 / 28
Gibbs sampling and Metropolis-HastingsI Recall that in MH we have an acceptance probability given by
α(y|x) = min
1,p(y)q(x|y)
p(x)q(y|x)
I Gibbs sampling can be viewed as a special case of the MH algorithmwith proposal distributions given by the conditional p(xj = b|x−j).We then get that for xj = a and yj = b with xi = yi otherwise, that
p(y)q(x|y)
p(x)q(y|x)=p(y)p(xj = a|x−j)
p(x)p(xj = b|x−j)
=p(..., xj = b, ...)p(..., xj = a, ...)/
∑a∗p(...,xj=a∗,...)
p(..., xj = a, ...)p(..., xj = b, ...)/∑
b∗p(...,xj=b∗,...)
= 1
I That is, every proposed move is accepted
Part A Simulation. HT 2020. R. Davies. 22 / 28
GMM example, revisited
I Return to Gaussian Mixture Model example from beginning of lecture
I Let p(x) =∑K
k=1 πkpk(x), where π is a vector of mixture proportions,and pk is a normal density with mean µk and variance σ2k = 1
I Consider a latent (hidden) variable Z of whether we are in the first orsecond distribution, so that P (Z = k) = πk and
X|Z = k ∼ N(µk, σk)
Z|X = x ∼ P (Z = k|X = x) =fk(x|µk, σk)πk∑Kj=1 fj(x|µj , σj)πj
I We can jointly estimate the distribution of (X,Z), and then justconsider the X that we want
Part A Simulation. HT 2020. R. Davies. 23 / 28
0 2000 4000 6000 8000 10000
02
46
810
t
x[t]
Histogram of x[t]
0 2 4 6 8 10
010
020
030
040
0 Joint DensityDensity 1Density 2
Figure: µ1 = 3, µ2 = 7
Part A Simulation. HT 2020. R. Davies. 24 / 28
Code
set.seed(9110)
n <- 10000
pi <- c(0.5, 0.5); mus <- c(3, 7)
x <- numeric(n); z <- numeric(n)
x[1] <- 4; z[1] <- 1
for(t in 1:(n - 1))
x[t + 1] <- x[t]; z[t + 1] <- z[t]
for(s in sample(1:2))
if (s == 1) ## sample z
u <- dnorm(x[t + 1], mean = mus[1]) * pi[1]
l <- dnorm(x[t + 1], mean = mus[2]) * pi[2]
choose1 <- runif(1) < (u / (u + l))
z[t + 1] <- c(2, 1)[as.integer(choose1) + 1]
else ## sample x
x[t + 1] <- rnorm(1, mean = mus[z[t + 1]], sd = 1)
Part A Simulation. HT 2020. R. Davies. 25 / 28
Bivariate normal example revisitedI Consider from earlier this lecture about the bivariate normal to
generate samples Zt = (Xt, Yt) from a bivariate Normal distributionwith mean µ = (0, 0) and covariance
Σ =
(1 0.7
√2
0.7√
2 2
)I In the general case for a bivariate normal with means µx, µy,
variances σ2x, σ2y and correlation ρ, once can derive (not shown here)
the marginal distribution of f(y|x) as
Xt+1|Yt = y ∼ N(µx + ρσxσy
(y − µy), σ2x(1− ρ2))
I Which in our case is
Xt+1|Yt = y ∼ N(0.71√2y, (1− 0.72))
I In other words, we initialize (X0, Y0) at some value, and conditionallysample X and Y given their marginal distributionsPart A Simulation. HT 2020. R. Davies. 26 / 28
Bivariate normal example revisited, example
N = 1 T = 1000
−4 −2 0 2 4
−4−2
02
4
Part A Simulation. HT 2020. R. Davies. 27 / 28
Recap
I When using MCMC in practice, one must carefully consider tuningparameters and the proposal distribution to avoid insufficnient mixing/ exploration of the space, as well as burn-in
I Gibbs sampling is a particular form of MCMC where we useconditional probabilities instead of a proposal
I It is akin to Metropolis Hastings with an acceptance probability of 1
Part A Simulation. HT 2020. R. Davies. 28 / 28