28
Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9 th March, 2020, 12:22 Robert Davies Part A Simulation and Statistical Programming Hilary Term 2020 Part A Simulation. HT 2020. R. Davies. 1 / 28

Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Simulation - Lectures 8 - MCMC: GibbsLecture version: Monday 9th March, 2020, 12:22

Robert Davies

Part A Simulation and Statistical Programming

Hilary Term 2020

Part A Simulation. HT 2020. R. Davies. 1 / 28

Page 2: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Outline

Metropolis Hastings

Gibbs sampler

Part A Simulation. HT 2020. R. Davies. 2 / 28

Page 3: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Example: Gaussian mixture model

I Let p(x) =∑K

k=1 πkpk(x), where π is a vector of mixtureproportions, and pk is a normal density with mean µk and variance σ2k

I Here, suppose π1 = 0.5 with µ1 = 3, µ2 = 5 and σ1 = σ2 = 1

I MH algorithm with target pdf p and proposal transition pdf

q(y|x) =

1 for y ∈ [x− 1/2, x+ 1/2]0 otherwise

I Acceptance probability

α(y|x) = min

(1,p(y)q(x|y)

p(x)q(y|x)

)= min (1, p(y)/p(x))

Part A Simulation. HT 2020. R. Davies. 3 / 28

Page 4: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Code

set.seed(9110)

p <- function(x)

0.5 * dnorm(x, mean = 3) + 0.5 * dnorm(x, mean = 5)

n <- 10000

x <- numeric(n)

x[1] <- 4

for(t in 1:(n - 1))

yt <- x[t] + (runif(1) - 0.5)

if (runif(1) < (p(yt) / p(x[t])))

x[t + 1] <- yt

else

x[t + 1] <- x[t]

Part A Simulation. HT 2020. R. Davies. 4 / 28

Page 5: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

0 2000 4000 6000 8000 10000

24

68

t

x[t]

Histogram of x[t]

2 4 6 8

010

020

030

040

050

060

0

Joint DensityDensity 1Density 2

Figure: µ1 = 3, µ2 = 5

Part A Simulation. HT 2020. R. Davies. 5 / 28

Page 6: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

0 2000 4000 6000 8000 10000

24

68

10

t

x[t]

Histogram of x[t]

2 4 6 8 10

010

020

030

040

050

0

Joint DensityDensity 1Density 2

Figure: µ1 = 3, µ2 = 7

Part A Simulation. HT 2020. R. Davies. 6 / 28

Page 7: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

0 2000 4000 6000 8000 10000

05

1015

t

x[t]

Histogram of x[t]

0 5 10 15

010

020

030

040

0

Joint DensityDensity 1Density 2

Figure: µ1 = 3, µ2 = 13

Part A Simulation. HT 2020. R. Davies. 7 / 28

Page 8: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Bivariate normal exampleI Consider a Metropolis algorithm for simulating two-dimensional

samples Zt = (Xt, Yt) from a bivariate Normal distribution with meanµ = (0, 0) and covariance

Σ =

(1 0.7

√2

0.7√

2 2

)I Using a proposal distribution that considers only the current value for

that dimension

Xt+1|Xt = x, Yt ∼ U(x− w, x+ w) i ∈ 1, 2

I The performance of the algorithm can be controlled by setting wI If w is small then we propose smaller moves and the chain will move

slowlyI If w is large then we propose larger moves and may accept only a few

movesI There is an ‘art’ to implementing MCMC that involves choice of good

proposal distributions.Part A Simulation. HT 2020. R. Davies. 8 / 28

Page 9: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Example

N = 4 w = 0.1 T = 100

−4 −2 0 2 4

−4−2

02

4

Part A Simulation. HT 2020. R. Davies. 9 / 28

Page 10: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Example

N = 4 w = 0.1 T = 1000

−4 −2 0 2 4

−4−2

02

4

Part A Simulation. HT 2020. R. Davies. 10 / 28

Page 11: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Example

N = 4 w = 0.1 T = 10000

−4 −2 0 2 4

−4−2

02

4

Part A Simulation. HT 2020. R. Davies. 11 / 28

Page 12: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Example

N = 4 w = 0.01 T = 10000

−4 −2 0 2 4

−4−2

02

4

Part A Simulation. HT 2020. R. Davies. 12 / 28

Page 13: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Example

N = 4 w = 0.01 T = 1e+05

−4 −2 0 2 4

−4−2

02

4

Part A Simulation. HT 2020. R. Davies. 13 / 28

Page 14: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Example

N = 4 w = 10 T = 1000

−4 −2 0 2 4

−4−2

02

4

Part A Simulation. HT 2020. R. Davies. 14 / 28

Page 15: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

MCMC: Practical aspects

I The MCMC chain does not start from the stationary distribution, soE[φ(Xt)] 6= E[φ(X)] and the difference can be significant for small t

I Xt converges in distribution to p as t→∞I Common practice is to discard the b first values of the Markov chainX0, . . . , Xb−1, where we assume that Xb is approximately distributedfrom p

I We use the estimator1

n− b

n−1∑t=b

φ(Xt)

I The initial X0, . . . , Xb−1 is called the burn-in period of the MCMCchain.

Part A Simulation. HT 2020. R. Davies. 15 / 28

Page 16: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Burn in exampleI Consider the first example of this lecture (GMM) for θ = E[X] forµ1 = 3, µ2 = 5.

I Let X0 = −5. This is an unlikely start p(−5) = 2× 10−15

I Note that in real world, choosing good start points is non-trivialI Below, estimates of θ wrt burn-in across chains given fixed n = 2000

0 100 200 300 400 500

3.0

3.5

4.0

4.5

5.0

Burn in

thet

a

averageper−chain

Part A Simulation. HT 2020. R. Davies. 16 / 28

Page 17: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Outline

Metropolis Hastings

Gibbs sampler

Part A Simulation. HT 2020. R. Davies. 17 / 28

Page 18: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

MCMC: Gibbs Sampler

I The Gibbs sampling algorithm is a Markov Chain Monte Carlo(MCMC) algorithm to simulate a Markov chain with a givenstationary distribution

I Unlike with MCMC Metropolis Hastings, where we use a proposal andacceptance / rejection, in Gibbs sampling, we make use of conditionaldistributions

I In the proof that follows we will assume a discrete distribution, but itfollows for continuous distributions (unproved)

I Note we will use notation x−j = (x1, ..., xj−1, xj+1, . . . , xd toindicate that the entries of x minutes the jth entry

I Further note we will use notation like xtj to refer to the jth entry ofxt, where xt is the t+ 1st realization / sample of the Markov chain

Part A Simulation. HT 2020. R. Davies. 18 / 28

Page 19: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Gibbs Sampler algorithm

Gibbs Sampler algorithm

1. Let p be the target pmf with conditional p(xtj |xt−j) for x with ddimensions

2. Either set X0 = x0, or draw X0 from some initial distribution

3. For t = 1, 2, . . . , n− 1:

3.1 Assume Xt−1 = xt−1, and set Xt = xt−1

3.2 Choose a permutation S of 1, ..., d3.3 For j ∈ S:

3.3.1 Sample Xtj from p(xt

j |xt−j)

Part A Simulation. HT 2020. R. Davies. 19 / 28

Page 20: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Gibbs Sampler about

I The Gibbs sample defines a Markov chain with transition matrix P ,such that, for x, y ∈ Ω with xi = yi ∀i 6= j, with xj = a and yj = b,that

Px,y = p(xj = b|x−j) =P (x1, . . . , xj = b, . . .)∑a∗ P (x1, ..., xj = a∗, ...)

Part A Simulation. HT 2020. R. Davies. 20 / 28

Page 21: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Gibbs Sampler theorem and proof

Theorem

The transition matrix P of the Markov chain generated by the Gibbssampler is reversible with respect to p and therefore admits p as astationary distribution.

I Proof: We check detailed balance. For x 6= y ∈ Ω withxi = yi ∀i 6= j, with xj = a and yj = b, that

p(x)Px,y = p(x)p(xj = b|x−j)

= p(x1..., xj = a, ...)P (x1, . . . , xj = b, . . .)∑b∗ P (x1, ..., xj = b∗, ...)

= p(x1..., xj = b, ...)P (x1, . . . , xj = a, . . .)∑a∗ P (x1, ..., xj = a∗, ...)

= p(y)p(xj = a|x−j)

= p(y)Py,x

Part A Simulation. HT 2020. R. Davies. 21 / 28

Page 22: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Gibbs sampling and Metropolis-HastingsI Recall that in MH we have an acceptance probability given by

α(y|x) = min

1,p(y)q(x|y)

p(x)q(y|x)

I Gibbs sampling can be viewed as a special case of the MH algorithmwith proposal distributions given by the conditional p(xj = b|x−j).We then get that for xj = a and yj = b with xi = yi otherwise, that

p(y)q(x|y)

p(x)q(y|x)=p(y)p(xj = a|x−j)

p(x)p(xj = b|x−j)

=p(..., xj = b, ...)p(..., xj = a, ...)/

∑a∗p(...,xj=a∗,...)

p(..., xj = a, ...)p(..., xj = b, ...)/∑

b∗p(...,xj=b∗,...)

= 1

I That is, every proposed move is accepted

Part A Simulation. HT 2020. R. Davies. 22 / 28

Page 23: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

GMM example, revisited

I Return to Gaussian Mixture Model example from beginning of lecture

I Let p(x) =∑K

k=1 πkpk(x), where π is a vector of mixture proportions,and pk is a normal density with mean µk and variance σ2k = 1

I Consider a latent (hidden) variable Z of whether we are in the first orsecond distribution, so that P (Z = k) = πk and

X|Z = k ∼ N(µk, σk)

Z|X = x ∼ P (Z = k|X = x) =fk(x|µk, σk)πk∑Kj=1 fj(x|µj , σj)πj

I We can jointly estimate the distribution of (X,Z), and then justconsider the X that we want

Part A Simulation. HT 2020. R. Davies. 23 / 28

Page 24: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

0 2000 4000 6000 8000 10000

02

46

810

t

x[t]

Histogram of x[t]

0 2 4 6 8 10

010

020

030

040

0 Joint DensityDensity 1Density 2

Figure: µ1 = 3, µ2 = 7

Part A Simulation. HT 2020. R. Davies. 24 / 28

Page 25: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Code

set.seed(9110)

n <- 10000

pi <- c(0.5, 0.5); mus <- c(3, 7)

x <- numeric(n); z <- numeric(n)

x[1] <- 4; z[1] <- 1

for(t in 1:(n - 1))

x[t + 1] <- x[t]; z[t + 1] <- z[t]

for(s in sample(1:2))

if (s == 1) ## sample z

u <- dnorm(x[t + 1], mean = mus[1]) * pi[1]

l <- dnorm(x[t + 1], mean = mus[2]) * pi[2]

choose1 <- runif(1) < (u / (u + l))

z[t + 1] <- c(2, 1)[as.integer(choose1) + 1]

else ## sample x

x[t + 1] <- rnorm(1, mean = mus[z[t + 1]], sd = 1)

Part A Simulation. HT 2020. R. Davies. 25 / 28

Page 26: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Bivariate normal example revisitedI Consider from earlier this lecture about the bivariate normal to

generate samples Zt = (Xt, Yt) from a bivariate Normal distributionwith mean µ = (0, 0) and covariance

Σ =

(1 0.7

√2

0.7√

2 2

)I In the general case for a bivariate normal with means µx, µy,

variances σ2x, σ2y and correlation ρ, once can derive (not shown here)

the marginal distribution of f(y|x) as

Xt+1|Yt = y ∼ N(µx + ρσxσy

(y − µy), σ2x(1− ρ2))

I Which in our case is

Xt+1|Yt = y ∼ N(0.71√2y, (1− 0.72))

I In other words, we initialize (X0, Y0) at some value, and conditionallysample X and Y given their marginal distributionsPart A Simulation. HT 2020. R. Davies. 26 / 28

Page 27: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Bivariate normal example revisited, example

N = 1 T = 1000

−4 −2 0 2 4

−4−2

02

4

Part A Simulation. HT 2020. R. Davies. 27 / 28

Page 28: Simulation - Lectures 8 - MCMC: Gibbsrdavies/teaching/PartASSP/2020/...Simulation - Lectures 8 - MCMC: Gibbs Lecture version: Monday 9th March, 2020, 12:22 Robert Davies Part A Simulation

Recap

I When using MCMC in practice, one must carefully consider tuningparameters and the proposal distribution to avoid insufficnient mixing/ exploration of the space, as well as burn-in

I Gibbs sampling is a particular form of MCMC where we useconditional probabilities instead of a proposal

I It is akin to Metropolis Hastings with an acceptance probability of 1

Part A Simulation. HT 2020. R. Davies. 28 / 28