90
Some recent advances in Markov chain and sequential Monte Carlo methods St´ ephane S´ en´ ecal The Institute of Statistical Mathematics, Research Organization of Information and Systems 15/12/2004 thanks to the Japan Society for the Promotion of Science 1

talk MCMC & SMC 2004

Embed Size (px)

Citation preview

Page 1: talk MCMC & SMC 2004

Some recent advances in Markov chain andsequential Monte Carlo methods

Stephane Senecal

The Institute of Statistical Mathematics,

Research Organization of Information and Systems

15/12/2004

thanks to the Japan Society for the Promotion of Science

1

Page 2: talk MCMC & SMC 2004

Estimation

xS=F(., )

b

Information on (x, θ): distribution of probability

p(x, θ|y, F, prior ) ∝ p(y|x, θ, F, prior )× p(x, θ|prior )

⇒ Estimates (x, θ)

2

Page 3: talk MCMC & SMC 2004

Estimates

• Maximum a posteriori (MAP)

(x, θ) = arg maxx,θ

p(x, θ|y, prior )

• Expectation: posterior mean E {x, θ|y, prior}

Ep(.|y,prior ) {f(x, θ)} =

∫f(x, θ)p(x, θ|y, prior )d(x, θ)

Computation : asymptotic, numerical, stochastic methods

⇒ Monte Carlo simulation methods

3

Page 4: talk MCMC & SMC 2004

Monte Carlo Estimates

x1, . . . , xN ∼ π

⇒ πN =1

N

N∑

n=1

δxn

SN (f) =1

N

N∑

n=1

f(xn) −→∫f(x)π(x)dx = Eπ {f}

xmax = arg maxxn

πN approximates xmax = arg maxx

π(x)

⇒ generate samples x` ∼ π ?

→ Markov chain and sequential Monte Carlo

4

Page 5: talk MCMC & SMC 2004

Overview

• Introduction to Markov chain Monte Carlo (MCMC)

Space alternating techniques

Estimation of Gaussian mixture models

• Introduction to Sequential Monte Carlo (SMC)

Fixed-lag sampling techniques

Recursive estimation of time series models

5

Page 6: talk MCMC & SMC 2004

Simulation Techniques

• Classical distributions : cumulated density function

→ transformation of uniform random variable

• Non-standard distributions, Rn, known up to a normalizing

constant → usage of instrumental distribution:

Accept-reject, importance sampling → sequential/recursive

⇒ SMC aka particle filtering, condensation algorithm

⇒ MCMC : distribution = fixed point of an operator

π = Kπ

→ simulation schemes with Markov chain: Hastings-Metropolis,

Gibbs sampling

6

Page 7: talk MCMC & SMC 2004

Markov Chain

Definition:

Xn|Xn−1, Xn−2, . . . , X0d= Xn|Xn−1

homogeneity : Xn|Xn−1 independent of n

Realization:

X0 ∼ π0(x0)

p.d.f. of Xn|Xn−1 = transition kernel K(xn|xn−1)

7

Page 8: talk MCMC & SMC 2004

Simulation of Markov chain

Convergence: Xn ∼ π asymptotically ?

π-invariance : π(.) = Kπ(.)∫

A

π(x)dx =

y∈A

∫K(y|x)π(x)dxdy

⇐ π-reversibility : Pr(A→ B) = Pr(B → A)∫

y∈B

x∈AK(y|x)π(x)dxdy =

y∈A

x∈BK(y|x)π(x)dxdy

Construct kernels K(.|.) such that the chain is π-invariant

• Hastings-Metropolis algorithm

• Gibbs sampling

8

Page 9: talk MCMC & SMC 2004

Hastings-Metropolis

Draw x from π(.)

1. initialize x0 ∼ π0(x)

2. Iteration `

• propose candidate x? for x`+1 → x? ∼ q(x|x`)• accept it with prob α = min{1, r}

3. `← `+ 1 and go to (2)

r =π(x?)q(x`|x?)q(x?|x`)π(x`)

→ π(x)K(y|x) = π(y)K(x|y)

π(x)q(y|x) min

{1,π(y)q(x|y)

q(y|x)π(x)

}= min {π(x)q(y|x), π(y)q(x|y)}

q(x?|x`) = q(x?) q(x?|x`) = q(|x? − x`|)

9

Page 10: talk MCMC & SMC 2004

Example

sample x ∼ p(x) ∝ 11+x2 20,000 iterations

x? ∼ N (x`, 0.12)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

−5

0

5

10

15

−6 −4 −2 0 2 4 6 8 10 12 140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

acc. rate = 97%

x? ∼ U[a,b]

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

−15

−10

−5

0

5

10

15

−15 −10 −5 0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

acc. rate = 26%

10

Page 11: talk MCMC & SMC 2004

Gibbs sampling algorithm

Sample x = (x1, ...xp) ∼ π(x1, ...xp)

1. initialize x(0) ∼ π0(x), ` = 0

2. iteration ` : Sample

x(`+1)1 ∼ π1(x1|x(`)

2 , . . . , x(`)p )

x(`+1)2 ∼ π2(x2|x(`+1)

1 , x(`)3 , . . . , x(`)

p )

...

x(`+1)p ∼ πp(xp|x(`+1)

1 , . . . , x(`+1)p−1 )

3. `← `+ 1 and go to (2)

→ no rejection, reversible kernel

11

Page 12: talk MCMC & SMC 2004

x =

x1

x2

∼ N

0

0

,

1 ρ

ρ 1

x(`+1)1 |x(`)

2 ∼ N(ρx

(`)2 , 1− ρ2

)

x(`+1)2 |x(`+1)

1 ∼ N(ρx

(`+1)1 , 1− ρ2

)

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

x1

x2

5,000 samples, ρ=0.5

−6 −4 −2 0 2 4 60

1000

2000

3000

4000

5000

6000

7000

8000

9000

−6 −4 −2 0 2 4 60

1000

2000

3000

4000

5000

6000

7000

8000

9000

histograms (x`1, x`2)

12

Page 13: talk MCMC & SMC 2004

How to obtain fast converging simulation scheme ?

→ Missing Data, Data Augmentation, Latent Variables

Idea : extend sampling space x→ (x, z) and distribution

π(x)→ π(x, z) with constraint∫π(x, z)dz = π(x)

such that Markov chain (x(i), z(i)) ∼ π faster

• Optimization : Expectation-Maximization (EM) algorithm

• Simulation : Data Augmentation, Gibbs sampling

13

Page 14: talk MCMC & SMC 2004

Efficient Data Augmentation Schemes

Idea: construct missing data space as less informative as possible

x

pi(x)

x ∼ π(x)

x

pitilde(x,z) = constant

z

(x, z) ∼ π(x, z)

Information introduced in missing data % : convergence 1

14

Page 15: talk MCMC & SMC 2004

Efficient Data Augmentation Schemes

EM algorithm → Space Alternating Generalized EM

SAGE algorithm, Hero and Fessler 1994:

• update parameter components by subblocks

• specific missing data space associated with each subblock

• complete data spaces less informative → convergence rate 1

15

Page 16: talk MCMC & SMC 2004

Efficient Data Augmentation Sampling Schemes

SAGE Idea → MCMC algorithm:

• sample parameter components by subblocks

• each subblock of parameters is sampled conditionaly on a specific

missing data set

⇒ Space Alternating Data Augmentation (SADA)

A. Doucet, T. Matsui, S. Senecal 2004

• Optimization : EM algorithm → SAGE algorithm

• Simulation : DA, Gibbs sampling → SADA

16

Page 17: talk MCMC & SMC 2004

Overview - Space alternating techniques

• → Introduction to EM and SAGE algorithms

• Introduction to Data Augmentation and SADA algorithms

• Application to Finite Mixture of Gaussians

17

Page 18: talk MCMC & SMC 2004

EM and SAGE Algorithms

Bayesian framework: obtaining MAP estimate of random variable X

given realization of Y = y

xMAP = arg max p (x|y)

where

p (x|y) ∝ p (y|x) p (x)

X is random vector whose components are partitioned into n subsets

X = X1:n = (X1, . . . , Xn)

Notation X−k = X1:n\ {Xk} = (X1, . . . , Xk−1, Xk+1, . . . , Xn) and

Zk:j = (Zk, Zk+1, . . . , Zj)

18

Page 19: talk MCMC & SMC 2004

Expectation-Maximization (EM) algorithm

→ Maximize p (x|y)

⇒ introduce missing data Z with conditional distribution p (z|y, x)

EM, iteration i:

E-step : compute Q(x, x(i−1)) =

∫log (p (x, z|y)) p

(z|y, x(i−1)

)dz

M-step : set x(i) = arg maxx

Q(x, x(i−1))

19

Page 20: talk MCMC & SMC 2004

Space Alternating EM (SAGE) algorithm

→ Maximize p (x|y)

⇒ introduce n missing data sets Z1:n with each random

variable/vector Zk is given a conditional distribution p (zk|y, x1:n)

satisfying

p (y|x1:n, zk) = p (y|x−k, zk)

→ zk independent of xk conditionaly on x−k and y

→ non-informative missing data space

20

Page 21: talk MCMC & SMC 2004

Space Alternating EM (SAGE) algorithm

SAGE, iteration i:

• select index k ∈ {1, . . . , n}e.g. components updated cyclically k = (i mod n) + 1

• EM step for computing x(i)k :

set x(i)k = arg max

x

∫log(p(x

(i−1)−k , xk, z|y

))p(zk|y, x(i−1)

)dzk

and set x(i)−k = x

(i−1)−k

21

Page 22: talk MCMC & SMC 2004

DA and SADA Algorithms

Bayesian framework: objective not only to maximize p (x|y) but to

obtain random samples{X(i)

}distributed according to p (x|y)

Based on samples{X(i)

}, approximation of MMSE estimate:

xMMSE =1

N

N∑

i=1

X(i) → xMMSE =

∫xp (x|y) dx

Also possible to compute posterior variances, confidence intervals or

predictive distributions.

Construction of efficient MCMC algorithms typically difficult

→ introduction of missing data

22

Page 23: talk MCMC & SMC 2004

Data Augmentation, Gibbs sampling

→ Sample p (x|y)

⇒ introduce missing data Z with joint posterior distribution

p (x, z|y) = p (x|y) p (z|y, x)

Data Augmentation algorithm, iteration i given X (i−1):

• Sample Z(i) ∼ p(·|y,X(i−1)

)

• Sample X(i) ∼ p(·|y, Z(i)

)

23

Page 24: talk MCMC & SMC 2004

Convergence of DA/Gibbs sampling algorithm

• Transition kernel associated to{X(i), Z(i)

}admits p (x, z|y) as

invariant distribution

• Under weak additional assumptions

(irreducibility and aperiodicity)

instantaneous distribution of(X(i), Z(i)

)converges towards

p (x, z|y) as i→ +∞

24

Page 25: talk MCMC & SMC 2004

Space Alternating Data Augmentation

→ Sample p (x|y)

⇒ introduce n missing data sets Z1:n with each random variable Zkis given a conditional distribution p (zk|y, x1:n) such that

p (y|x1:n, zk) = p (y|x−k, zk)

→ zk independent of xk conditionaly on x−k and y

→ non-informative missing data space

Sampling of joint posterior distribution:

p (x1:n, z1:n|y) = p (x1:n|y)n∏

k=1

p (zk|y, x1:n)

25

Page 26: talk MCMC & SMC 2004

Space Alternating Data Augmentation

SADA algorithm, iteration i

given X(i−1)1:n and component index k:

• Sample Z(i)k ∼ p

(·|y,X(i−1)

)

• Sample X(i)k ∼ p

(·|y, Z(i)

k , X(i−1)−k

)

• Set X(i)−k = X

(i−1)−k

Components updated cyclically k = (i mod n) + 1

26

Page 27: talk MCMC & SMC 2004

Validity of SADA sampling algorithm

Generation of Markov chain{X

(i)1:n, Z

(i)1:n

}with invariant distribution

p (x1:n, z1:n|y)

Idea: SADA equivalent to

• Sample Z(i)k , Z−k ∼ p

(·|y,X(i−1)

1:n

)

• Sample X(i)k , Z−k ∼ p

(·|y, Z(i)

k , X(i−1)−k

)

• Set X(i)−k = X

(i−1)−k

27

Page 28: talk MCMC & SMC 2004

Validity of SADA sampling algorithm

SADA → sample Zk and Xk but also Z−k at each iteration

sampling according to full conditional distributions p (z1:n|y, x1:n)

and p (x1:n|y, z1:n)

⇒ ad hoc invariant distribution p (x1:n, z1:n|y)

sampling of Z−k not necessary → discarded

28

Page 29: talk MCMC & SMC 2004

Overview - Space alternating techniques

• Introduction to EM and SAGE algorithms

• Introduction to Data Augmentation and SADA algorithms

• ⇒ Application to Finite Mixture of Gaussians

29

Page 30: talk MCMC & SMC 2004

Finite Mixture of Gaussians

EM/DA algorithms routinely used to perform ML/MAP parameter

estimation/to sample the posterior distribution

Straightforward extensions to hidden Markov chains with Gaussian

observations

T i.i.d. observations Y1:T in Rd, distributed according to a finite

mixture of s Gaussians

Yt ∼s∑

j=1

πjN (µj ; Σj)

30

Page 31: talk MCMC & SMC 2004

Bayesian Estimation

Parameters

X = {(µj ,Σj , πj) ; j = 1, . . . , s}unknown, random, distributed from conjugate prior distributions

µj |Σj ∼ N (αj ,Σj/λj)

Σ−1j ∼ W (rj , Cj)

(π1, . . . , πs) ∼ D (ζ1, . . . , ζs)

31

Page 32: talk MCMC & SMC 2004

Bayesian Estimation

Σ−1 ∼ W (r, C): Wishart distribution, p.d.f. proportional to

|Σ−1| 12 (r−d−1) exp

(−1

2tr(Σ−1C−1

))

(π1, . . . , πs) ∼ D (ζ1, . . . , ζs): Dirichlet distribution restricted to the

simplex, p.d.f. proportional to∏sk=1 π

ζk−1k

Hyperparameters {(αj , λj , rj , Cj , ζj) ; j = 1, . . . , s} assumed fixed but

could be estimated from data in a hierarchical Bayes model

32

Page 33: talk MCMC & SMC 2004

Missing Data for Finite Mixture of Gaussians

EM/DA introduce the i.i.d. missing data Zt ∈ {1, . . . , s} such that

Yt|Zt = j ∼ N (µj ; Σj)

Pr (Zt = j) = πj

Gibbs sampling algorithm, iteration i:

• sample discrete latent variables Z(i)t ∼ p

(·|yt, X(i−1)

)

• compute sufficient statistics n(i)j ,

∑Tt=1 δZ(i)

t ,j,

n(i)j y

(i)j ,

∑Tt=1 δZ(i)

t ,jyt and S

(i)

j ,∑Tt=1 δZ(i)

t ,jyty

Tt

• sample parameters

33

Page 34: talk MCMC & SMC 2004

Gibbs sampling for Finite Mixture of Gaussians

sampling parameters, iteration i:

Σ−1(i)j ∼ W

(rj + n

(i)j , Σ

−1(i)j

)

µ(i)j |Σ

(i)j ∼ N

(m

(i)j ,

Σ(i)j

λj + n(i)j

)

(i)1 , . . . , π(i)

s

)∼ D

(n

(i)1 + ζ1, . . . , n

(i)s + ζs

)

m(i)j =

λjαj + n(i)j y

(i)j

λj + n(i)j

Σ(i)j = C−1

j + λjαjαTj + S

(i)

j −(λj + n

(i)j

)m

(i)j m

(i)Tj

34

Page 35: talk MCMC & SMC 2004

Less Informative Missing Data

update only(µj , τ

2j

),(µ−j , τ2

−j)

fixed

→ binary missing data Zt,j ∈ {0, j} such that Pr (Zt,j = j) = πj

variable Zt,j = “observation coming from component j or not”, less

informative than knowing “from which particular component

observation is derived”

constraint∑sj=1 πj = 1⇒ cannot update πj , use of standard EM

approach for sampling the weights

35

Page 36: talk MCMC & SMC 2004

Less Informative Missing Data

→ updating jointly the parameters of two components j and k

(A. Doucet, T. Matsui and S. Senecal, 2004)

→ missing data Zt,j,k ∈ {0, j, k} such that

Pr (Zt,j,k = j) = πj , Pr (Zt,j,k = k) = πk

and

Yt|Zt,j,k = j ∼ N (µj ; Σj)

Yt|Zt,j,k = k ∼ N (µk; Σk)

Yt|Zt,j,k = 0 ∼∑l 6=j,l 6=k πlN (µl; Σl)∑

l 6=j,l 6=k πl

36

Page 37: talk MCMC & SMC 2004

SAGE algorithm for Finite Mixture of Gaussians

update for(µj , τ

2j

), iteration i:

µ(i)j =

λjαj +∑Tt=1 ytp

(Zt,j,k = j|yt, X(i−1)

)

λj +∑Tt=1 p

(Zt,j,k = j|yt, X(i−1)

)

Σ(i)j =

C−1j + λj

(i)j − αj

)(µ

(i)j − αj

)T

+ . . .

. . .

. . .+T∑t=1

(yt − µ(i)

j

)(yt − µ(i)

j

)T

p(Zt,j,k = j|yt, X(i−1)

)

rj − d− 1 + λj +T∑t=1

p(Zt,j,k = j|yt, X(i−1)

)

37

Page 38: talk MCMC & SMC 2004

SAGE algorithm for Finite Mixture of Gaussians

update for πj , iteration i:

π(i)j =

1−∑l 6=j,l 6=k π(i−1)l

1 +

T∑t=1

p(Zt,j,k=k|yt,X(i−1))+(ζk−1)

T∑t=1

p(Zt,j,k=j|yt,X(i−1))+(ζj−1)

π(i)k = 1− π(i)

j −∑

l 6=j,l 6=kπ

(i−1)l

38

Page 39: talk MCMC & SMC 2004

SADA algorithm for Finite Mixture of Gaussians

SADA algorithm, iteration i, sample (µj ,Σj , πj)

• sample discrete latent variables

Z(i)t,j,k ∼ p

(·|yt, X(i−1)

)

• compute sufficient statistics n(i)j ,

∑Tt=1 δZ(i)

t,j,k,jand

n(i)j y

(i)j ,

T∑

t=1

δZ

(i)t,j,k,j

yt, S(i)

j ,T∑

t=1

δZ

(i)t,j,k,j

ytyTt

• sample parameters

39

Page 40: talk MCMC & SMC 2004

SADA algorithm for Finite Mixture of Gaussians

sampling parameters, iteration i:

Σ−1(i)j ∼ W

(rj + n

(i)j , Σ

−1(i)j

)

µ(i)j |Σ

(i)j ∼ N

(m

(i)j ,

Σ(i)j

λj + n(i)j

)

(i)j , π

(i)k

)∼

1−

l 6=j,l 6=kπ

(i−1)l

D

(n

(i)j + ζj , n

(i)k + ζk

)

40

Page 41: talk MCMC & SMC 2004

Numerical experiments

Mixture of s = 8 d = 10-dimensional Gaussians

T = 100 samples

Parameters of components sampled from prior with parameters

ζj = 1, αj = 0, λj = 0.01, rj = d+ 1 and Cj = 0.01I

100 iterations of EM and SAGE algorithms

41

Page 42: talk MCMC & SMC 2004

Numerical experiments - s = 8 d = 10

0 5 10 15 20 25 30 35 40 45 50−2000

−1800

−1600

−1400

−1200

−1000

−800

−600

−400

Log of posterior p.d.f. values (straight EM/dotted SAGE) / iterations

42

Page 43: talk MCMC & SMC 2004

Numerical experiments - s = 5 d = 25

0 5 10 15 20 25 30 35 40 45 50−5000

−4500

−4000

−3500

−3000

−2500

−2000

−1500

−1000

−500

0

Log of posterior p.d.f. values (straight EM/dotted SAGE) / iterations

43

Page 44: talk MCMC & SMC 2004

Simulations

Mixture of s = 5 d = 10-dimensional Gaussians T = 100, parameters

of components sampled from prior with parameters ζj = 1, αj = 0,

λj = 0.01, rj = d+ 1 and Cj = 0.01I

200 iterations of EM and SAGE 50 times

5000 iterations of DA and SADA 10 times

Results:

• EM/SAGE: mean of log-posterior values at final iteration

• SA/SADA: mean of average log-posterior values of last 1000

iterations

44

Page 45: talk MCMC & SMC 2004

Simulations Results

s EM SAGE DA SADA

5 -915.8 -671.5 -873.7 -886.0

6 -929.6 –603.2 -877.3 -886.7

7 -941.4 -576.5 -893.9 -906.9

8 -965.7 -559.2 -904.9 -875.0

9 -968.9 -503.0 -898.8 -882.5

10 -983.2 -478.1 -924.0 -906.6

Log-posterior values for final iteration EM/SAGE

and average log-posterior values for DA/SADA

45

Page 46: talk MCMC & SMC 2004

Conclusion - Perspectives

• Sampling complex distributions: MCMC → Hastings-Metropolis,

Gibbs sampler

• Speed-up convergence of optimisation/simulation algorithms:

missing data, data augmentation, latent/extended variable

→ space alternating techniques, non-informative data spaces

• Applications in modeling/estimation: speech processing,

tomography, digital communication, . . .

46

Page 47: talk MCMC & SMC 2004

References - EM/SAGE/MCMC

• G. J. McLachlan and T. Krishnan, The EM Algorithm and

Extensions, Wiley Series in Probability and Statistics, 1997

• J. A. Fessler and A. O. Hero, Space-alternating generalized

expectation-maximization algorithm, IEEE Trans. Sig. Proc.,

42:2664–2677, 1994

• C. P. Robert and G. Casella, Monte Carlo Statistical Methods,

Springer-Verlag, 1999

• A. Doucet, T. Matsui and S. Senecal, Space Alternating Data

Augmentation, ICASSP’05, 2005

47

Page 48: talk MCMC & SMC 2004

Overview - MCMC and SMC methods

• Introduction to Markov chain Monte Carlo (MCMC)

Space alternating techniques

Estimation of Gaussian mixture models

• Introduction to Sequential Monte Carlo (SMC)

Fixed-lag sampling techniques

Recursive estimation of time series models

48

Page 49: talk MCMC & SMC 2004

Estimation of state space models

xt = ft(xt−1, ut) yt = gt(xt, vt)

p(x0:t|y1:t) → p(xt|y1:t) =

∫p(x0:t|y1:t)dx0:t−1

distribution of x0:t ⇒ computation of estimate x0:t:

x0:t =

∫x0:tp(x0:t|y1:t)dx0:t → Ep(.|y1:t){f(x0:t)}

x0:t = arg maxx0:t

p(x0:t|y1:t)

49

Page 50: talk MCMC & SMC 2004

Computation of the estimates

p(x0:t|y1:t) ⇒ multidimensionnal, non-standard distributions:

→ analytical, numerical approximations

→ integration, optimisation methods

⇒ Monte Carlo techniques

50

Page 51: talk MCMC & SMC 2004

Monte Carlo approach

compute estimates for distribution π(.) → samples x1, . . . , xN ∼ π

x

\pi(x)

x_1 x_N

⇒ distribution πN = 1N

∑Ni=1 δxi approximates π(.)

51

Page 52: talk MCMC & SMC 2004

Monte Carlo estimates

SN (f) =1

N

N∑

i=1

f(xi) −→∫f(x)π(x)dx = Eπ{f(x)}

arg max(xi)1≤i≤N πN (xi) approximates arg maxx π(x)

⇒ sampling xi ∼ π difficult

→ importance sampling techniques

52

Page 53: talk MCMC & SMC 2004

Simulation Techniques

• Classical distributions : cumulated density function

→ transformation of uniform random variable

• Non-standard distributions, Rn, known up to a normalizing

constant → usage of instrumental distribution:

Accept-reject, importance sampling → sequential/recursive

⇒ SMC aka particle filtering, condensation algorithm

⇒ MCMC : distribution = fixed point of an operator, Markov

chain → simulation schemes: Hastings-Metropolis, Gibbs

sampling

53

Page 54: talk MCMC & SMC 2004

Importance Sampling

xi ∼ π → candidate/proposal distribution xi ∼ g

x

g(x)

\pi(x)

x_Nx_1

54

Page 55: talk MCMC & SMC 2004

Importance Sampling

xi ∼ g 6= π → (xi, wi) weighted sample

⇒ weight wi =π(xi)

g(xi)

x

g(x)

\pi(x)

x_Nx_1

55

Page 56: talk MCMC & SMC 2004

Estimation

importance sampling → computation of Monte Carlo estimates

e. g. expectations Eπ{f(x)}:∫f(x)

π(x)

g(x)g(x)dx =

∫f(x)π(x)dx

N∑

i=1

wif(xi) →∫f(x)π(x)dx = Eπ{f(x)}

dynamic model (xt, yt) ⇒ recursive estimation x0:t−1 → x0:t

Monte Carlo techniques ⇒ sampling sequences x(i)0:t−1 → x

(i)0:t

56

Page 57: talk MCMC & SMC 2004

Sequential simulation

sampling sequences x(i)0:t ∼ πt(x0:t) recursively:

time

variablestate

x

p(x,t) target distribution:

t

t2

t1

p(x,t2)

x_t1

x_t2

p(x_t1)

p(x_t2)

p(x,t1)

57

Page 58: talk MCMC & SMC 2004

Sequential simulation: importance sampling

samples x(i)0:t ∼ πt(x0:t) approximated by weighted particles

(x(i)0:t, w

(i)t )1≤i≤N

time

p(x,t) target distribution:

p(x,t2)

t

t2

t1

x

p(x,t1)

58

Page 59: talk MCMC & SMC 2004

Sequential importance sampling

diffusing particles x(i)0:t1→ x

(i)0:t2

time

p(x,t) target distribution:

p(x,t2)

t

x

p(x,t1)

t2

t1

⇒ sampling scheme x(i)0:t−1 → x

(i)0:t

59

Page 60: talk MCMC & SMC 2004

Sequential importance sampling

updating weights w(i)t1 → w

(i)t2

time

p(x,t) target distribution:

p(x,t2)

t

p(x,t1)

x

t2

t1

⇒ updating rule w(i)t−1 → w

(i)t

60

Page 61: talk MCMC & SMC 2004

Sequential Importance Sampling

x0:t ∼ πt(x0:t)⇒ (x(i)0:t, w

(i)t )1≤i≤N

Simulation scheme t− 1 → t:

• Sampling step x(i)t ∼ qt(xt|x(i)

0:t−1)

• Updating weights

w(i)t ∝ w(i)

t−1 ×πt(x

(i)0:t−1, x

(i)t )

πt−1(x(i)0:t−1)qt(x

(i)t |x(i)

0:t−1)︸ ︷︷ ︸incremental weight (iw)

normalizing∑Ni=1 w

(i)t = 1

61

Page 62: talk MCMC & SMC 2004

Sequential Importance Sampling

x0:t ∼ πt(x0:t)⇒ (x(i)0:t, w

(i)t )1≤i≤N

proposal + reweighting →

\pi(x_t)

x_t

62

Page 63: talk MCMC & SMC 2004

Sequential Importance Sampling

proposal + reweighting → var{(w(i)t )1≤i≤N} ↗ with t

x_t

\pi(x_t)

→ w(i)t ≈ 0 for all i except one

63

Page 64: talk MCMC & SMC 2004

⇒ Resampling

x_t

\pi(x_t)

0 x_t^(1)

x_t^(j)1x_t^(i)2 x_t^(k)3

x_t^(N)0

→ draw N particles paths from the set (x(i)0:t)1≤i≤N

with probability (w(i)t )1≤i≤N

64

Page 65: talk MCMC & SMC 2004

Sequential Importance Sampling/Resampling

Simulation scheme t− 1 → t:

• Sampling step x,(i)t ∼ qt(x,t|x(i)

0:t−1)

• Updating weights w(i)t ∝ w(i)

t−1 ×πt(x

(i)0:t−1,x

,(i)t )

πt−1(x(i)0:t−1)qt(x

,(i)t |x

(i)0:t−1)

→ parallel computing

• ⇒ Resampling step: sample N paths from (x(i)0:t−1, x

,(i)t )1≤i≤N

→ particles interacting : computation at least O(N)

65

Page 66: talk MCMC & SMC 2004

FV: Sequential simulation: SISR

Recursive estimation of state space models.

Approximation with particles, importance sampling.

time

x

p_t(x)

t

t+1

Bootstrap, particle filtering

Gordon et al. 1993, Kitagawa 1996, Doucet et al. 2001

→ time series, tracking.

66

Page 67: talk MCMC & SMC 2004

FV: Sequential Importance Sampling/Resampling

Samples x(i)0:t ∼ πt(x0:t) approximated by

weighted particles (x(i)0:t, w

(i)t )1≤i≤N

Simulation scheme t− 1 → t:

• Sampling step x,(i)t ∼ qt(x,t|x(i)

0:t−1)

• Updating weights w(i)t ∝ w(i)

t−1 ×πt(x

(i)0:t−1, x

,(i)t )

πt−1(x(i)0:t−1)qt(x

,(i)t |x(i)

0:t−1)︸ ︷︷ ︸incremental weight (iw)

• Resampling step: sample N paths from (x(i)0:t−1, x

,(i)t )1≤i≤N

67

Page 68: talk MCMC & SMC 2004

SISR for recursive estimation of state space models

xt = ft(xt−1, ut) → p(xt|xt−1)

yt = gt(xt, vt) → p(yt|xt)

Usual SISR: Bootstrap filter (Gordon et al. 93, Kitagawa 96):

• Sampling step x(i)t ∼ p(xt|x(i)

t−1)

• Updating weights : incremental weight w(i)t ∝ w(i)

t−1 × iw

iw ∝ p(yt|x(i)t )

• Stratified/Deterministic resampling

efficient, easy, fast for a wide class of models

tracking, time series → nonlinear non-Gaussian state spaces

68

Page 69: talk MCMC & SMC 2004

Improving simulation

Optimal proposal distribution qt(xt|x(i)0:t−1)

→ mimimizing variance of incremental weight (w(i)t ∝ w(i)

t−1 × iw)

iw =πt(x

(i)0:t−1, x

(i)t )

πt−1(x(i)0:t−1)qt(x

(i)t |x(i)

0:t−1)

⇒ 1-step ahead predictive:

πt(xt|x0:t−1) = p(xt|xt−1, yt)

⇒ incremental weight:

iw → πt(x0:t−1)

πt−1(x0:t−1)=

p(x0:t−1|y1:t)

p(x0:t−1|y1:t−1)

∝ p(yt|xt−1) =

∫p(yt|xt)p(xt|xt−1)dxt

69

Page 70: talk MCMC & SMC 2004

Improving simulation

sampling/approximating predictive πt(xt|x0:t−1) may not be efficient

for diffusing particles: e.g. discrepancy (πt)t>0 high:

⇒ consider a block of variables xt−L:t for a fixed lag L

70

Page 71: talk MCMC & SMC 2004

Approaches using a block of variables

• discrete distributions, Meirovitch 1985

• auxiliary variables, Pitt and Shephard 1999

• reweighting before resampling, Wang et al. 2002

⇒ discrete distribution → analytical form for

xt ∼ πt+L(xt|x0:t−1) =

∫πt+L(xt:t+L|x0:t−1)dxt+1:t+L

Meirovitch 1985: random walk in discrete space (growing a polymer)

→ complexity ]XL for lag L

71

Page 72: talk MCMC & SMC 2004

Reweighting + resampling

21

010

0

0

0

1 1

72

Page 73: talk MCMC & SMC 2004

Reweighting

→ need to sample xt by block

⇒ design a proposal/candidate distribution

73

Page 74: talk MCMC & SMC 2004

Sampling recursively a block of variables

t−L t−L+1 tt−1

xt−L:t−1 → xt−L+1:t: imputing xt and re-imputing xt−L+1:t−1

74

Page 75: talk MCMC & SMC 2004

Sampling a block of variables

t−L t−L+1 tt−1

t−L+1x’(

t−L+1x(

0

:0 t−1x(

:0 t−Lx(

: t)

)

)

)t−1:

Proposal/candidate distribution for the “natural” block:

(x0:t−L, x′t−L+1:t) ∼

∫πt−1(x0:t−1)qt(x

′t−L+1:t|x0:t−1)dxt−L+1:t−1

75

Page 76: talk MCMC & SMC 2004

Sampling a block of variables

t−L t−L+1 tt−1

t−L+1x’(

t−L+1x(

0

:0 t−1x(

:0 t−Lx(

: t)

)

)

)t−1:

Candidate distribution for the extended block:

(x0:t−L, x′t−L+1:t)→ (x0:t−L, xt−L+1:t−1, x

′t−L+1:t) :

(x0:t−1, x′t−L+1:t) ∼ πt−1(x0:t−1)qt(x

′t−L+1:t|x0:t−1)

76

Page 77: talk MCMC & SMC 2004

Sampling a block of variables

Target distribution for the “natural” block (x0:t−L, x′t−L+1:t):

πt(x0:t−L, x′t−L+1:t)

⇒ auxiliary target distribution for the extended block

(x0:t−1, x′t−L+1:t) = (x0:t−L, xt−L+1:t−1, x

′t−L+1:t) :

πt(x0:t−L, x′t−L+1:t)rt(xt−L+1:t−1|x0:t−L, x

′t−L+1:t)

with rt = any conditional distribution

⇒ proposal + target distributions → importance sampling

77

Page 78: talk MCMC & SMC 2004

Fixed-Lag Sequential Monte Carlo

A. Doucet and S. Senecal, 2004

Simulation scheme t− 1 → t (index (i) dropped):

• Sampling step

x′t−L+1:t ∼ qt(x′t−L+1:t|x0:t−1)

• Updating weights

wt ∝ wt−1 ×πt(x0:t−L, x′t−L+1:t)rt(xt−L+1:t−1|x0:t−L, x′t−L+1:t)

πt−1(x0:t−1)qt(x′t−L+1:t|x0:t−1)

• Resampling step

78

Page 79: talk MCMC & SMC 2004

Improving simulation

Optimal proposal distribution qt(x′t−L+1:t|x0:t−1):

→ mimimizing variance of incremental weight:

iw =πt(x0:t−L, x′t−L+1:t)rt(xt−L+1:t−1|x0:t−L, x′t−L+1:t)

πt−1(x0:t−1)qt(x′t−L+1:t|x0:t−1)

⇒ qt = L-step ahead predictive

πt(x′t−L+1:t|x0:t−L) = p(x′t−L+1:t|xt−L, yt−L+1:t)

For one variable: optimal qt = 1-step ahead predictive

πt(xt|x0:t−1) = p(xt|xt−1, yt)

79

Page 80: talk MCMC & SMC 2004

Improving simulation

Mimimizing variance of incremental weight

⇒ optimal target distribution

iw =πt(x0:t−L, x′t−L+1:t)rt(xt−L+1:t−1|x0:t−L, x′t−L+1:t)

πt−1(x0:t−1)qt(x′t−L+1:t|x0:t−1)

→ optimal conditional distribution rt(xt−L+1:t−1|x0:t−L, x′t−L+1:t)

⇒ rt = (L− 1)-step ahead predictive

πt−1(xt−L+1:t−1|x0:t−L) = p(xt−L+1:t−1|xt−L, yt−L+1:t−1)

80

Page 81: talk MCMC & SMC 2004

Improving simulation

For optimal qt and rt, incremental weight:

iw → πt(x0:t−L)

πt−1(x0:t−L)=

p(x0:t−L|y1:t)

p(x0:t−L|y1:t−1)∝ p(yt|xt−L, yt−L+1:t−1)

∝∫p(yt, xt−L+1:t|xt−L, yt−L+1:t−1)dxt−L+1:t

SISR for one variable with optimal proposal qt:

iw → πt(x0:t−1)

πt−1(x0:t−1)= p(yt|xt−1) =

∫p(yt|xt)p(xt|xt−1)dxt

Bootstrap filter: iw = p(yt|xt)

81

Page 82: talk MCMC & SMC 2004

Example

Nonlinear state space model:

xt = α(xt−1 + βx3t−1) + ut x0, ut ∼ N (0, σ2

u)

yt = xt + vt vt ∼ N (0, σ2v)

Sequential Monte Carlo methods:

• Bootstrap filter, proposal p(xt|xt−1)

• SISR with optimal proposal p(xt|xt−1, yt)

• SISR for blocks with optimal proposal p(xt−L+1:t|xt−L, yt−L+1:t)

approximated by forward-backward recursions with KF/EKF

Parameters values α=0.9, β=0.4, σu=0.1 and σv=0.05

⇒ approximation of target distribution p(xt|y1:t)

82

Page 83: talk MCMC & SMC 2004

Approximation of the target distribution

⇒ Effective Sample Size:

ESS =1

∑Ni=1[w

(i)t ]2

w(i) = 1N : ESS = N

\pi(x_t)

x_t

w(i) ≈ 0 ∀i except one: ESS = 1

x_t

\pi(x_t)

⇒ Resampling performed for ESS ≤ N2 ,

N10

83

Page 84: talk MCMC & SMC 2004

Simulation results

algorithm MSE ESS RS CPU

Bootstrap 0.0021 36.8 70.3 % 0.68

SISR 0.0019 65.8 19.2% 0.48

BSISR-KF 0.0018 72.3 0.9% 0.21

BSISR-EKF 0.0018 73.5 0.8% 0.24

N = 100 particles, 100 runs of particle filters for a single and for a

block of L = 2 variables.

84

Page 85: talk MCMC & SMC 2004

Approximation of the target distribution

Resampling for ESS ≤ N2 , N = 100

0 20 40 60 80 100 120 140 160 180 2000

10

20

30

40

50

60

70

80

90

100

time index

Effect

ive Sa

mple S

ize

Approximated ESS vs. time index the Bootstrap filter (dotted), the

SISR with optimal proposal for a single variable (dashdotted) and

approximated for a block of L=2 variables (straight).

85

Page 86: talk MCMC & SMC 2004

Simulation results

block size L N=100 N=500 N=1000 RS

2 74 370 715 0.9%

3 96 493 985 0.9%

4 99 496 989 1%

5 98 494 988 1%

10 97 486 972 2.5%

Approximated ESS averaged over 100 runs of particle filters for

blocks of L variables, considering N particles.

86

Page 87: talk MCMC & SMC 2004

CPU time / number of particles N

Resampling for ESS ≤ N2 , 1,000 time steps

100 200 300 400 500 600 700 800 900 10000

0.5

1

1.5

2

2.5

CPU time vs. N for bootstrap filter (black), SISR with optimal

proposal for a single variable (blue) and approximated for a block of

L=2 variables (red), 100 realizations.

87

Page 88: talk MCMC & SMC 2004

Conclusions - Perspectives

⇒ Importance of proposal/candidate distribution for sequential

Monte Carlo simulation methods

Design of proposal:

→ information in observation, dynamic of the state variable:

p(xt|xt−1)←→ p(xt|yt, xt−1)←→ p(x′t−L+1:t|xt−L, yt−L+1:t)

→ sampling a block/fixed lag of variables can be useful:

• for intermittent/informative observation, correlated variables

• applications ⇒ radar, navigation/positioning, tracking

88

Page 89: talk MCMC & SMC 2004

References - SISR, Sequential Monte Carlo

• N. Gordon, D. Salmond, and A. F. M. Smith, “Novel approach to

nonlinear and non-Gaussian Bayesian state estimation,”

Proceedings IEE-F, vol. 140, pp. 107–113, 1993.

• G. Kitagawa, “Monte carlo filter and smoother for non-Gaussian

nonlinear state space models,” J. Comput. Graph. Statist., vol.

5, pp. 1–25, 1996.

• A. Doucet, N. de Freitas, and N. Gordon, Eds., Sequential Monte

Carlo methods in practice, Statistics for engineering and

information science. Springer, 2001.

89

Page 90: talk MCMC & SMC 2004

References - fixed-lag approaches

• H. Meirovitch, “Scanning method as an unbiased simulation

technique and its application to the study of self-avoiding

random walks,” Phys. Rev. A, vol. 32, pp. 3699–3708, 1985.

• M. K. Pitt and N. Shephard, “Filtering via simulation: auxiliary

particle filter,” J. Am. Stat. Assoc., vol. 94, pp. 590–599, 1999.

• X. Wang, R. Chen, and D. Guo, “Delayed-pilot sampling for

mixture Kalman filter with application in fading channels,” IEEE

Trans. Sig. Proc., vol. 50, pp. 241–253, 2002.

• A. Doucet and S. Senecal, “Fixed-Lag Sequential Monte Carlo”,

Proceedings of EUSIPCO2004.

90