Upload
stephane-senecal
View
221
Download
2
Embed Size (px)
Citation preview
Some recent advances in Markov chain andsequential Monte Carlo methods
Stephane Senecal
The Institute of Statistical Mathematics,
Research Organization of Information and Systems
15/12/2004
thanks to the Japan Society for the Promotion of Science
1
Estimation
xS=F(., )
b
yθ
Information on (x, θ): distribution of probability
p(x, θ|y, F, prior ) ∝ p(y|x, θ, F, prior )× p(x, θ|prior )
⇒ Estimates (x, θ)
2
Estimates
• Maximum a posteriori (MAP)
(x, θ) = arg maxx,θ
p(x, θ|y, prior )
• Expectation: posterior mean E {x, θ|y, prior}
Ep(.|y,prior ) {f(x, θ)} =
∫f(x, θ)p(x, θ|y, prior )d(x, θ)
Computation : asymptotic, numerical, stochastic methods
⇒ Monte Carlo simulation methods
3
Monte Carlo Estimates
x1, . . . , xN ∼ π
⇒ πN =1
N
N∑
n=1
δxn
SN (f) =1
N
N∑
n=1
f(xn) −→∫f(x)π(x)dx = Eπ {f}
xmax = arg maxxn
πN approximates xmax = arg maxx
π(x)
⇒ generate samples x` ∼ π ?
→ Markov chain and sequential Monte Carlo
4
Overview
• Introduction to Markov chain Monte Carlo (MCMC)
Space alternating techniques
Estimation of Gaussian mixture models
• Introduction to Sequential Monte Carlo (SMC)
Fixed-lag sampling techniques
Recursive estimation of time series models
5
Simulation Techniques
• Classical distributions : cumulated density function
→ transformation of uniform random variable
• Non-standard distributions, Rn, known up to a normalizing
constant → usage of instrumental distribution:
Accept-reject, importance sampling → sequential/recursive
⇒ SMC aka particle filtering, condensation algorithm
⇒ MCMC : distribution = fixed point of an operator
π = Kπ
→ simulation schemes with Markov chain: Hastings-Metropolis,
Gibbs sampling
6
Markov Chain
Definition:
Xn|Xn−1, Xn−2, . . . , X0d= Xn|Xn−1
homogeneity : Xn|Xn−1 independent of n
Realization:
X0 ∼ π0(x0)
p.d.f. of Xn|Xn−1 = transition kernel K(xn|xn−1)
7
Simulation of Markov chain
Convergence: Xn ∼ π asymptotically ?
π-invariance : π(.) = Kπ(.)∫
A
π(x)dx =
∫
y∈A
∫K(y|x)π(x)dxdy
⇐ π-reversibility : Pr(A→ B) = Pr(B → A)∫
y∈B
∫
x∈AK(y|x)π(x)dxdy =
∫
y∈A
∫
x∈BK(y|x)π(x)dxdy
Construct kernels K(.|.) such that the chain is π-invariant
• Hastings-Metropolis algorithm
• Gibbs sampling
8
Hastings-Metropolis
Draw x from π(.)
1. initialize x0 ∼ π0(x)
2. Iteration `
• propose candidate x? for x`+1 → x? ∼ q(x|x`)• accept it with prob α = min{1, r}
3. `← `+ 1 and go to (2)
r =π(x?)q(x`|x?)q(x?|x`)π(x`)
→ π(x)K(y|x) = π(y)K(x|y)
π(x)q(y|x) min
{1,π(y)q(x|y)
q(y|x)π(x)
}= min {π(x)q(y|x), π(y)q(x|y)}
q(x?|x`) = q(x?) q(x?|x`) = q(|x? − x`|)
9
Example
sample x ∼ p(x) ∝ 11+x2 20,000 iterations
x? ∼ N (x`, 0.12)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
−5
0
5
10
15
−6 −4 −2 0 2 4 6 8 10 12 140
0.1
0.2
0.3
0.4
0.5
0.6
0.7
acc. rate = 97%
x? ∼ U[a,b]
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
acc. rate = 26%
10
Gibbs sampling algorithm
Sample x = (x1, ...xp) ∼ π(x1, ...xp)
1. initialize x(0) ∼ π0(x), ` = 0
2. iteration ` : Sample
x(`+1)1 ∼ π1(x1|x(`)
2 , . . . , x(`)p )
x(`+1)2 ∼ π2(x2|x(`+1)
1 , x(`)3 , . . . , x(`)
p )
...
x(`+1)p ∼ πp(xp|x(`+1)
1 , . . . , x(`+1)p−1 )
3. `← `+ 1 and go to (2)
→ no rejection, reversible kernel
11
x =
x1
x2
∼ N
0
0
,
1 ρ
ρ 1
x(`+1)1 |x(`)
2 ∼ N(ρx
(`)2 , 1− ρ2
)
x(`+1)2 |x(`+1)
1 ∼ N(ρx
(`+1)1 , 1− ρ2
)
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
x1
x2
5,000 samples, ρ=0.5
−6 −4 −2 0 2 4 60
1000
2000
3000
4000
5000
6000
7000
8000
9000
−6 −4 −2 0 2 4 60
1000
2000
3000
4000
5000
6000
7000
8000
9000
histograms (x`1, x`2)
12
How to obtain fast converging simulation scheme ?
→ Missing Data, Data Augmentation, Latent Variables
Idea : extend sampling space x→ (x, z) and distribution
π(x)→ π(x, z) with constraint∫π(x, z)dz = π(x)
such that Markov chain (x(i), z(i)) ∼ π faster
• Optimization : Expectation-Maximization (EM) algorithm
• Simulation : Data Augmentation, Gibbs sampling
13
Efficient Data Augmentation Schemes
Idea: construct missing data space as less informative as possible
x
pi(x)
x ∼ π(x)
x
pitilde(x,z) = constant
z
(x, z) ∼ π(x, z)
Information introduced in missing data % : convergence 1
14
Efficient Data Augmentation Schemes
EM algorithm → Space Alternating Generalized EM
SAGE algorithm, Hero and Fessler 1994:
• update parameter components by subblocks
• specific missing data space associated with each subblock
• complete data spaces less informative → convergence rate 1
15
Efficient Data Augmentation Sampling Schemes
SAGE Idea → MCMC algorithm:
• sample parameter components by subblocks
• each subblock of parameters is sampled conditionaly on a specific
missing data set
⇒ Space Alternating Data Augmentation (SADA)
A. Doucet, T. Matsui, S. Senecal 2004
• Optimization : EM algorithm → SAGE algorithm
• Simulation : DA, Gibbs sampling → SADA
16
Overview - Space alternating techniques
• → Introduction to EM and SAGE algorithms
• Introduction to Data Augmentation and SADA algorithms
• Application to Finite Mixture of Gaussians
17
EM and SAGE Algorithms
Bayesian framework: obtaining MAP estimate of random variable X
given realization of Y = y
xMAP = arg max p (x|y)
where
p (x|y) ∝ p (y|x) p (x)
X is random vector whose components are partitioned into n subsets
X = X1:n = (X1, . . . , Xn)
Notation X−k = X1:n\ {Xk} = (X1, . . . , Xk−1, Xk+1, . . . , Xn) and
Zk:j = (Zk, Zk+1, . . . , Zj)
18
Expectation-Maximization (EM) algorithm
→ Maximize p (x|y)
⇒ introduce missing data Z with conditional distribution p (z|y, x)
EM, iteration i:
E-step : compute Q(x, x(i−1)) =
∫log (p (x, z|y)) p
(z|y, x(i−1)
)dz
M-step : set x(i) = arg maxx
Q(x, x(i−1))
19
Space Alternating EM (SAGE) algorithm
→ Maximize p (x|y)
⇒ introduce n missing data sets Z1:n with each random
variable/vector Zk is given a conditional distribution p (zk|y, x1:n)
satisfying
p (y|x1:n, zk) = p (y|x−k, zk)
→ zk independent of xk conditionaly on x−k and y
→ non-informative missing data space
20
Space Alternating EM (SAGE) algorithm
SAGE, iteration i:
• select index k ∈ {1, . . . , n}e.g. components updated cyclically k = (i mod n) + 1
• EM step for computing x(i)k :
set x(i)k = arg max
x
∫log(p(x
(i−1)−k , xk, z|y
))p(zk|y, x(i−1)
)dzk
and set x(i)−k = x
(i−1)−k
21
DA and SADA Algorithms
Bayesian framework: objective not only to maximize p (x|y) but to
obtain random samples{X(i)
}distributed according to p (x|y)
Based on samples{X(i)
}, approximation of MMSE estimate:
xMMSE =1
N
N∑
i=1
X(i) → xMMSE =
∫xp (x|y) dx
Also possible to compute posterior variances, confidence intervals or
predictive distributions.
Construction of efficient MCMC algorithms typically difficult
→ introduction of missing data
22
Data Augmentation, Gibbs sampling
→ Sample p (x|y)
⇒ introduce missing data Z with joint posterior distribution
p (x, z|y) = p (x|y) p (z|y, x)
Data Augmentation algorithm, iteration i given X (i−1):
• Sample Z(i) ∼ p(·|y,X(i−1)
)
• Sample X(i) ∼ p(·|y, Z(i)
)
23
Convergence of DA/Gibbs sampling algorithm
• Transition kernel associated to{X(i), Z(i)
}admits p (x, z|y) as
invariant distribution
• Under weak additional assumptions
(irreducibility and aperiodicity)
instantaneous distribution of(X(i), Z(i)
)converges towards
p (x, z|y) as i→ +∞
24
Space Alternating Data Augmentation
→ Sample p (x|y)
⇒ introduce n missing data sets Z1:n with each random variable Zkis given a conditional distribution p (zk|y, x1:n) such that
p (y|x1:n, zk) = p (y|x−k, zk)
→ zk independent of xk conditionaly on x−k and y
→ non-informative missing data space
Sampling of joint posterior distribution:
p (x1:n, z1:n|y) = p (x1:n|y)n∏
k=1
p (zk|y, x1:n)
25
Space Alternating Data Augmentation
SADA algorithm, iteration i
given X(i−1)1:n and component index k:
• Sample Z(i)k ∼ p
(·|y,X(i−1)
)
• Sample X(i)k ∼ p
(·|y, Z(i)
k , X(i−1)−k
)
• Set X(i)−k = X
(i−1)−k
Components updated cyclically k = (i mod n) + 1
26
Validity of SADA sampling algorithm
Generation of Markov chain{X
(i)1:n, Z
(i)1:n
}with invariant distribution
p (x1:n, z1:n|y)
Idea: SADA equivalent to
• Sample Z(i)k , Z−k ∼ p
(·|y,X(i−1)
1:n
)
• Sample X(i)k , Z−k ∼ p
(·|y, Z(i)
k , X(i−1)−k
)
• Set X(i)−k = X
(i−1)−k
27
Validity of SADA sampling algorithm
SADA → sample Zk and Xk but also Z−k at each iteration
sampling according to full conditional distributions p (z1:n|y, x1:n)
and p (x1:n|y, z1:n)
⇒ ad hoc invariant distribution p (x1:n, z1:n|y)
sampling of Z−k not necessary → discarded
28
Overview - Space alternating techniques
• Introduction to EM and SAGE algorithms
• Introduction to Data Augmentation and SADA algorithms
• ⇒ Application to Finite Mixture of Gaussians
29
Finite Mixture of Gaussians
EM/DA algorithms routinely used to perform ML/MAP parameter
estimation/to sample the posterior distribution
Straightforward extensions to hidden Markov chains with Gaussian
observations
T i.i.d. observations Y1:T in Rd, distributed according to a finite
mixture of s Gaussians
Yt ∼s∑
j=1
πjN (µj ; Σj)
30
Bayesian Estimation
Parameters
X = {(µj ,Σj , πj) ; j = 1, . . . , s}unknown, random, distributed from conjugate prior distributions
µj |Σj ∼ N (αj ,Σj/λj)
Σ−1j ∼ W (rj , Cj)
(π1, . . . , πs) ∼ D (ζ1, . . . , ζs)
31
Bayesian Estimation
Σ−1 ∼ W (r, C): Wishart distribution, p.d.f. proportional to
|Σ−1| 12 (r−d−1) exp
(−1
2tr(Σ−1C−1
))
(π1, . . . , πs) ∼ D (ζ1, . . . , ζs): Dirichlet distribution restricted to the
simplex, p.d.f. proportional to∏sk=1 π
ζk−1k
Hyperparameters {(αj , λj , rj , Cj , ζj) ; j = 1, . . . , s} assumed fixed but
could be estimated from data in a hierarchical Bayes model
32
Missing Data for Finite Mixture of Gaussians
EM/DA introduce the i.i.d. missing data Zt ∈ {1, . . . , s} such that
Yt|Zt = j ∼ N (µj ; Σj)
Pr (Zt = j) = πj
Gibbs sampling algorithm, iteration i:
• sample discrete latent variables Z(i)t ∼ p
(·|yt, X(i−1)
)
• compute sufficient statistics n(i)j ,
∑Tt=1 δZ(i)
t ,j,
n(i)j y
(i)j ,
∑Tt=1 δZ(i)
t ,jyt and S
(i)
j ,∑Tt=1 δZ(i)
t ,jyty
Tt
• sample parameters
33
Gibbs sampling for Finite Mixture of Gaussians
sampling parameters, iteration i:
Σ−1(i)j ∼ W
(rj + n
(i)j , Σ
−1(i)j
)
µ(i)j |Σ
(i)j ∼ N
(m
(i)j ,
Σ(i)j
λj + n(i)j
)
(π
(i)1 , . . . , π(i)
s
)∼ D
(n
(i)1 + ζ1, . . . , n
(i)s + ζs
)
m(i)j =
λjαj + n(i)j y
(i)j
λj + n(i)j
Σ(i)j = C−1
j + λjαjαTj + S
(i)
j −(λj + n
(i)j
)m
(i)j m
(i)Tj
34
Less Informative Missing Data
update only(µj , τ
2j
),(µ−j , τ2
−j)
fixed
→ binary missing data Zt,j ∈ {0, j} such that Pr (Zt,j = j) = πj
variable Zt,j = “observation coming from component j or not”, less
informative than knowing “from which particular component
observation is derived”
constraint∑sj=1 πj = 1⇒ cannot update πj , use of standard EM
approach for sampling the weights
35
Less Informative Missing Data
→ updating jointly the parameters of two components j and k
(A. Doucet, T. Matsui and S. Senecal, 2004)
→ missing data Zt,j,k ∈ {0, j, k} such that
Pr (Zt,j,k = j) = πj , Pr (Zt,j,k = k) = πk
and
Yt|Zt,j,k = j ∼ N (µj ; Σj)
Yt|Zt,j,k = k ∼ N (µk; Σk)
Yt|Zt,j,k = 0 ∼∑l 6=j,l 6=k πlN (µl; Σl)∑
l 6=j,l 6=k πl
36
SAGE algorithm for Finite Mixture of Gaussians
update for(µj , τ
2j
), iteration i:
µ(i)j =
λjαj +∑Tt=1 ytp
(Zt,j,k = j|yt, X(i−1)
)
λj +∑Tt=1 p
(Zt,j,k = j|yt, X(i−1)
)
Σ(i)j =
C−1j + λj
(µ
(i)j − αj
)(µ
(i)j − αj
)T
+ . . .
. . .
. . .+T∑t=1
(yt − µ(i)
j
)(yt − µ(i)
j
)T
p(Zt,j,k = j|yt, X(i−1)
)
rj − d− 1 + λj +T∑t=1
p(Zt,j,k = j|yt, X(i−1)
)
37
SAGE algorithm for Finite Mixture of Gaussians
update for πj , iteration i:
π(i)j =
1−∑l 6=j,l 6=k π(i−1)l
1 +
T∑t=1
p(Zt,j,k=k|yt,X(i−1))+(ζk−1)
T∑t=1
p(Zt,j,k=j|yt,X(i−1))+(ζj−1)
π(i)k = 1− π(i)
j −∑
l 6=j,l 6=kπ
(i−1)l
38
SADA algorithm for Finite Mixture of Gaussians
SADA algorithm, iteration i, sample (µj ,Σj , πj)
• sample discrete latent variables
Z(i)t,j,k ∼ p
(·|yt, X(i−1)
)
• compute sufficient statistics n(i)j ,
∑Tt=1 δZ(i)
t,j,k,jand
n(i)j y
(i)j ,
T∑
t=1
δZ
(i)t,j,k,j
yt, S(i)
j ,T∑
t=1
δZ
(i)t,j,k,j
ytyTt
• sample parameters
39
SADA algorithm for Finite Mixture of Gaussians
sampling parameters, iteration i:
Σ−1(i)j ∼ W
(rj + n
(i)j , Σ
−1(i)j
)
µ(i)j |Σ
(i)j ∼ N
(m
(i)j ,
Σ(i)j
λj + n(i)j
)
(π
(i)j , π
(i)k
)∼
1−
∑
l 6=j,l 6=kπ
(i−1)l
D
(n
(i)j + ζj , n
(i)k + ζk
)
40
Numerical experiments
Mixture of s = 8 d = 10-dimensional Gaussians
T = 100 samples
Parameters of components sampled from prior with parameters
ζj = 1, αj = 0, λj = 0.01, rj = d+ 1 and Cj = 0.01I
100 iterations of EM and SAGE algorithms
41
Numerical experiments - s = 8 d = 10
0 5 10 15 20 25 30 35 40 45 50−2000
−1800
−1600
−1400
−1200
−1000
−800
−600
−400
Log of posterior p.d.f. values (straight EM/dotted SAGE) / iterations
42
Numerical experiments - s = 5 d = 25
0 5 10 15 20 25 30 35 40 45 50−5000
−4500
−4000
−3500
−3000
−2500
−2000
−1500
−1000
−500
0
Log of posterior p.d.f. values (straight EM/dotted SAGE) / iterations
43
Simulations
Mixture of s = 5 d = 10-dimensional Gaussians T = 100, parameters
of components sampled from prior with parameters ζj = 1, αj = 0,
λj = 0.01, rj = d+ 1 and Cj = 0.01I
200 iterations of EM and SAGE 50 times
5000 iterations of DA and SADA 10 times
Results:
• EM/SAGE: mean of log-posterior values at final iteration
• SA/SADA: mean of average log-posterior values of last 1000
iterations
44
Simulations Results
s EM SAGE DA SADA
5 -915.8 -671.5 -873.7 -886.0
6 -929.6 –603.2 -877.3 -886.7
7 -941.4 -576.5 -893.9 -906.9
8 -965.7 -559.2 -904.9 -875.0
9 -968.9 -503.0 -898.8 -882.5
10 -983.2 -478.1 -924.0 -906.6
Log-posterior values for final iteration EM/SAGE
and average log-posterior values for DA/SADA
45
Conclusion - Perspectives
• Sampling complex distributions: MCMC → Hastings-Metropolis,
Gibbs sampler
• Speed-up convergence of optimisation/simulation algorithms:
missing data, data augmentation, latent/extended variable
→ space alternating techniques, non-informative data spaces
• Applications in modeling/estimation: speech processing,
tomography, digital communication, . . .
46
References - EM/SAGE/MCMC
• G. J. McLachlan and T. Krishnan, The EM Algorithm and
Extensions, Wiley Series in Probability and Statistics, 1997
• J. A. Fessler and A. O. Hero, Space-alternating generalized
expectation-maximization algorithm, IEEE Trans. Sig. Proc.,
42:2664–2677, 1994
• C. P. Robert and G. Casella, Monte Carlo Statistical Methods,
Springer-Verlag, 1999
• A. Doucet, T. Matsui and S. Senecal, Space Alternating Data
Augmentation, ICASSP’05, 2005
47
Overview - MCMC and SMC methods
• Introduction to Markov chain Monte Carlo (MCMC)
Space alternating techniques
Estimation of Gaussian mixture models
• Introduction to Sequential Monte Carlo (SMC)
Fixed-lag sampling techniques
Recursive estimation of time series models
48
Estimation of state space models
xt = ft(xt−1, ut) yt = gt(xt, vt)
p(x0:t|y1:t) → p(xt|y1:t) =
∫p(x0:t|y1:t)dx0:t−1
distribution of x0:t ⇒ computation of estimate x0:t:
x0:t =
∫x0:tp(x0:t|y1:t)dx0:t → Ep(.|y1:t){f(x0:t)}
x0:t = arg maxx0:t
p(x0:t|y1:t)
49
Computation of the estimates
p(x0:t|y1:t) ⇒ multidimensionnal, non-standard distributions:
→ analytical, numerical approximations
→ integration, optimisation methods
⇒ Monte Carlo techniques
50
Monte Carlo approach
compute estimates for distribution π(.) → samples x1, . . . , xN ∼ π
x
\pi(x)
x_1 x_N
⇒ distribution πN = 1N
∑Ni=1 δxi approximates π(.)
51
Monte Carlo estimates
SN (f) =1
N
N∑
i=1
f(xi) −→∫f(x)π(x)dx = Eπ{f(x)}
arg max(xi)1≤i≤N πN (xi) approximates arg maxx π(x)
⇒ sampling xi ∼ π difficult
→ importance sampling techniques
52
Simulation Techniques
• Classical distributions : cumulated density function
→ transformation of uniform random variable
• Non-standard distributions, Rn, known up to a normalizing
constant → usage of instrumental distribution:
Accept-reject, importance sampling → sequential/recursive
⇒ SMC aka particle filtering, condensation algorithm
⇒ MCMC : distribution = fixed point of an operator, Markov
chain → simulation schemes: Hastings-Metropolis, Gibbs
sampling
53
Importance Sampling
xi ∼ π → candidate/proposal distribution xi ∼ g
x
g(x)
\pi(x)
x_Nx_1
54
Importance Sampling
xi ∼ g 6= π → (xi, wi) weighted sample
⇒ weight wi =π(xi)
g(xi)
x
g(x)
\pi(x)
x_Nx_1
55
Estimation
importance sampling → computation of Monte Carlo estimates
e. g. expectations Eπ{f(x)}:∫f(x)
π(x)
g(x)g(x)dx =
∫f(x)π(x)dx
N∑
i=1
wif(xi) →∫f(x)π(x)dx = Eπ{f(x)}
dynamic model (xt, yt) ⇒ recursive estimation x0:t−1 → x0:t
Monte Carlo techniques ⇒ sampling sequences x(i)0:t−1 → x
(i)0:t
56
Sequential simulation
sampling sequences x(i)0:t ∼ πt(x0:t) recursively:
time
variablestate
x
p(x,t) target distribution:
t
t2
t1
p(x,t2)
x_t1
x_t2
p(x_t1)
p(x_t2)
p(x,t1)
57
Sequential simulation: importance sampling
samples x(i)0:t ∼ πt(x0:t) approximated by weighted particles
(x(i)0:t, w
(i)t )1≤i≤N
time
p(x,t) target distribution:
p(x,t2)
t
t2
t1
x
p(x,t1)
58
Sequential importance sampling
diffusing particles x(i)0:t1→ x
(i)0:t2
time
p(x,t) target distribution:
p(x,t2)
t
x
p(x,t1)
t2
t1
⇒ sampling scheme x(i)0:t−1 → x
(i)0:t
59
Sequential importance sampling
updating weights w(i)t1 → w
(i)t2
time
p(x,t) target distribution:
p(x,t2)
t
p(x,t1)
x
t2
t1
⇒ updating rule w(i)t−1 → w
(i)t
60
Sequential Importance Sampling
x0:t ∼ πt(x0:t)⇒ (x(i)0:t, w
(i)t )1≤i≤N
Simulation scheme t− 1 → t:
• Sampling step x(i)t ∼ qt(xt|x(i)
0:t−1)
• Updating weights
w(i)t ∝ w(i)
t−1 ×πt(x
(i)0:t−1, x
(i)t )
πt−1(x(i)0:t−1)qt(x
(i)t |x(i)
0:t−1)︸ ︷︷ ︸incremental weight (iw)
normalizing∑Ni=1 w
(i)t = 1
61
Sequential Importance Sampling
x0:t ∼ πt(x0:t)⇒ (x(i)0:t, w
(i)t )1≤i≤N
proposal + reweighting →
\pi(x_t)
x_t
62
Sequential Importance Sampling
proposal + reweighting → var{(w(i)t )1≤i≤N} ↗ with t
x_t
\pi(x_t)
→ w(i)t ≈ 0 for all i except one
63
⇒ Resampling
x_t
\pi(x_t)
0 x_t^(1)
x_t^(j)1x_t^(i)2 x_t^(k)3
x_t^(N)0
→ draw N particles paths from the set (x(i)0:t)1≤i≤N
with probability (w(i)t )1≤i≤N
64
Sequential Importance Sampling/Resampling
Simulation scheme t− 1 → t:
• Sampling step x,(i)t ∼ qt(x,t|x(i)
0:t−1)
• Updating weights w(i)t ∝ w(i)
t−1 ×πt(x
(i)0:t−1,x
,(i)t )
πt−1(x(i)0:t−1)qt(x
,(i)t |x
(i)0:t−1)
→ parallel computing
• ⇒ Resampling step: sample N paths from (x(i)0:t−1, x
,(i)t )1≤i≤N
→ particles interacting : computation at least O(N)
65
FV: Sequential simulation: SISR
Recursive estimation of state space models.
Approximation with particles, importance sampling.
time
x
p_t(x)
t
t+1
Bootstrap, particle filtering
Gordon et al. 1993, Kitagawa 1996, Doucet et al. 2001
→ time series, tracking.
66
FV: Sequential Importance Sampling/Resampling
Samples x(i)0:t ∼ πt(x0:t) approximated by
weighted particles (x(i)0:t, w
(i)t )1≤i≤N
Simulation scheme t− 1 → t:
• Sampling step x,(i)t ∼ qt(x,t|x(i)
0:t−1)
• Updating weights w(i)t ∝ w(i)
t−1 ×πt(x
(i)0:t−1, x
,(i)t )
πt−1(x(i)0:t−1)qt(x
,(i)t |x(i)
0:t−1)︸ ︷︷ ︸incremental weight (iw)
• Resampling step: sample N paths from (x(i)0:t−1, x
,(i)t )1≤i≤N
67
SISR for recursive estimation of state space models
xt = ft(xt−1, ut) → p(xt|xt−1)
yt = gt(xt, vt) → p(yt|xt)
Usual SISR: Bootstrap filter (Gordon et al. 93, Kitagawa 96):
• Sampling step x(i)t ∼ p(xt|x(i)
t−1)
• Updating weights : incremental weight w(i)t ∝ w(i)
t−1 × iw
iw ∝ p(yt|x(i)t )
• Stratified/Deterministic resampling
efficient, easy, fast for a wide class of models
tracking, time series → nonlinear non-Gaussian state spaces
68
Improving simulation
Optimal proposal distribution qt(xt|x(i)0:t−1)
→ mimimizing variance of incremental weight (w(i)t ∝ w(i)
t−1 × iw)
iw =πt(x
(i)0:t−1, x
(i)t )
πt−1(x(i)0:t−1)qt(x
(i)t |x(i)
0:t−1)
⇒ 1-step ahead predictive:
πt(xt|x0:t−1) = p(xt|xt−1, yt)
⇒ incremental weight:
iw → πt(x0:t−1)
πt−1(x0:t−1)=
p(x0:t−1|y1:t)
p(x0:t−1|y1:t−1)
∝ p(yt|xt−1) =
∫p(yt|xt)p(xt|xt−1)dxt
69
Improving simulation
sampling/approximating predictive πt(xt|x0:t−1) may not be efficient
for diffusing particles: e.g. discrepancy (πt)t>0 high:
⇒ consider a block of variables xt−L:t for a fixed lag L
70
Approaches using a block of variables
• discrete distributions, Meirovitch 1985
• auxiliary variables, Pitt and Shephard 1999
• reweighting before resampling, Wang et al. 2002
⇒ discrete distribution → analytical form for
xt ∼ πt+L(xt|x0:t−1) =
∫πt+L(xt:t+L|x0:t−1)dxt+1:t+L
Meirovitch 1985: random walk in discrete space (growing a polymer)
→ complexity ]XL for lag L
71
Reweighting + resampling
21
010
0
0
0
1 1
72
Reweighting
→ need to sample xt by block
⇒ design a proposal/candidate distribution
73
Sampling recursively a block of variables
t−L t−L+1 tt−1
xt−L:t−1 → xt−L+1:t: imputing xt and re-imputing xt−L+1:t−1
74
Sampling a block of variables
t−L t−L+1 tt−1
t−L+1x’(
t−L+1x(
0
:0 t−1x(
:0 t−Lx(
: t)
)
)
)t−1:
Proposal/candidate distribution for the “natural” block:
(x0:t−L, x′t−L+1:t) ∼
∫πt−1(x0:t−1)qt(x
′t−L+1:t|x0:t−1)dxt−L+1:t−1
75
Sampling a block of variables
t−L t−L+1 tt−1
t−L+1x’(
t−L+1x(
0
:0 t−1x(
:0 t−Lx(
: t)
)
)
)t−1:
Candidate distribution for the extended block:
(x0:t−L, x′t−L+1:t)→ (x0:t−L, xt−L+1:t−1, x
′t−L+1:t) :
(x0:t−1, x′t−L+1:t) ∼ πt−1(x0:t−1)qt(x
′t−L+1:t|x0:t−1)
76
Sampling a block of variables
Target distribution for the “natural” block (x0:t−L, x′t−L+1:t):
πt(x0:t−L, x′t−L+1:t)
⇒ auxiliary target distribution for the extended block
(x0:t−1, x′t−L+1:t) = (x0:t−L, xt−L+1:t−1, x
′t−L+1:t) :
πt(x0:t−L, x′t−L+1:t)rt(xt−L+1:t−1|x0:t−L, x
′t−L+1:t)
with rt = any conditional distribution
⇒ proposal + target distributions → importance sampling
77
Fixed-Lag Sequential Monte Carlo
A. Doucet and S. Senecal, 2004
Simulation scheme t− 1 → t (index (i) dropped):
• Sampling step
x′t−L+1:t ∼ qt(x′t−L+1:t|x0:t−1)
• Updating weights
wt ∝ wt−1 ×πt(x0:t−L, x′t−L+1:t)rt(xt−L+1:t−1|x0:t−L, x′t−L+1:t)
πt−1(x0:t−1)qt(x′t−L+1:t|x0:t−1)
• Resampling step
78
Improving simulation
Optimal proposal distribution qt(x′t−L+1:t|x0:t−1):
→ mimimizing variance of incremental weight:
iw =πt(x0:t−L, x′t−L+1:t)rt(xt−L+1:t−1|x0:t−L, x′t−L+1:t)
πt−1(x0:t−1)qt(x′t−L+1:t|x0:t−1)
⇒ qt = L-step ahead predictive
πt(x′t−L+1:t|x0:t−L) = p(x′t−L+1:t|xt−L, yt−L+1:t)
For one variable: optimal qt = 1-step ahead predictive
πt(xt|x0:t−1) = p(xt|xt−1, yt)
79
Improving simulation
Mimimizing variance of incremental weight
⇒ optimal target distribution
iw =πt(x0:t−L, x′t−L+1:t)rt(xt−L+1:t−1|x0:t−L, x′t−L+1:t)
πt−1(x0:t−1)qt(x′t−L+1:t|x0:t−1)
→ optimal conditional distribution rt(xt−L+1:t−1|x0:t−L, x′t−L+1:t)
⇒ rt = (L− 1)-step ahead predictive
πt−1(xt−L+1:t−1|x0:t−L) = p(xt−L+1:t−1|xt−L, yt−L+1:t−1)
80
Improving simulation
For optimal qt and rt, incremental weight:
iw → πt(x0:t−L)
πt−1(x0:t−L)=
p(x0:t−L|y1:t)
p(x0:t−L|y1:t−1)∝ p(yt|xt−L, yt−L+1:t−1)
∝∫p(yt, xt−L+1:t|xt−L, yt−L+1:t−1)dxt−L+1:t
SISR for one variable with optimal proposal qt:
iw → πt(x0:t−1)
πt−1(x0:t−1)= p(yt|xt−1) =
∫p(yt|xt)p(xt|xt−1)dxt
Bootstrap filter: iw = p(yt|xt)
81
Example
Nonlinear state space model:
xt = α(xt−1 + βx3t−1) + ut x0, ut ∼ N (0, σ2
u)
yt = xt + vt vt ∼ N (0, σ2v)
Sequential Monte Carlo methods:
• Bootstrap filter, proposal p(xt|xt−1)
• SISR with optimal proposal p(xt|xt−1, yt)
• SISR for blocks with optimal proposal p(xt−L+1:t|xt−L, yt−L+1:t)
approximated by forward-backward recursions with KF/EKF
Parameters values α=0.9, β=0.4, σu=0.1 and σv=0.05
⇒ approximation of target distribution p(xt|y1:t)
82
Approximation of the target distribution
⇒ Effective Sample Size:
ESS =1
∑Ni=1[w
(i)t ]2
w(i) = 1N : ESS = N
\pi(x_t)
x_t
w(i) ≈ 0 ∀i except one: ESS = 1
x_t
\pi(x_t)
⇒ Resampling performed for ESS ≤ N2 ,
N10
83
Simulation results
algorithm MSE ESS RS CPU
Bootstrap 0.0021 36.8 70.3 % 0.68
SISR 0.0019 65.8 19.2% 0.48
BSISR-KF 0.0018 72.3 0.9% 0.21
BSISR-EKF 0.0018 73.5 0.8% 0.24
N = 100 particles, 100 runs of particle filters for a single and for a
block of L = 2 variables.
84
Approximation of the target distribution
Resampling for ESS ≤ N2 , N = 100
0 20 40 60 80 100 120 140 160 180 2000
10
20
30
40
50
60
70
80
90
100
time index
Effect
ive Sa
mple S
ize
Approximated ESS vs. time index the Bootstrap filter (dotted), the
SISR with optimal proposal for a single variable (dashdotted) and
approximated for a block of L=2 variables (straight).
85
Simulation results
block size L N=100 N=500 N=1000 RS
2 74 370 715 0.9%
3 96 493 985 0.9%
4 99 496 989 1%
5 98 494 988 1%
10 97 486 972 2.5%
Approximated ESS averaged over 100 runs of particle filters for
blocks of L variables, considering N particles.
86
CPU time / number of particles N
Resampling for ESS ≤ N2 , 1,000 time steps
100 200 300 400 500 600 700 800 900 10000
0.5
1
1.5
2
2.5
CPU time vs. N for bootstrap filter (black), SISR with optimal
proposal for a single variable (blue) and approximated for a block of
L=2 variables (red), 100 realizations.
87
Conclusions - Perspectives
⇒ Importance of proposal/candidate distribution for sequential
Monte Carlo simulation methods
Design of proposal:
→ information in observation, dynamic of the state variable:
p(xt|xt−1)←→ p(xt|yt, xt−1)←→ p(x′t−L+1:t|xt−L, yt−L+1:t)
→ sampling a block/fixed lag of variables can be useful:
• for intermittent/informative observation, correlated variables
• applications ⇒ radar, navigation/positioning, tracking
88
References - SISR, Sequential Monte Carlo
• N. Gordon, D. Salmond, and A. F. M. Smith, “Novel approach to
nonlinear and non-Gaussian Bayesian state estimation,”
Proceedings IEE-F, vol. 140, pp. 107–113, 1993.
• G. Kitagawa, “Monte carlo filter and smoother for non-Gaussian
nonlinear state space models,” J. Comput. Graph. Statist., vol.
5, pp. 1–25, 1996.
• A. Doucet, N. de Freitas, and N. Gordon, Eds., Sequential Monte
Carlo methods in practice, Statistics for engineering and
information science. Springer, 2001.
89
References - fixed-lag approaches
• H. Meirovitch, “Scanning method as an unbiased simulation
technique and its application to the study of self-avoiding
random walks,” Phys. Rev. A, vol. 32, pp. 3699–3708, 1985.
• M. K. Pitt and N. Shephard, “Filtering via simulation: auxiliary
particle filter,” J. Am. Stat. Assoc., vol. 94, pp. 590–599, 1999.
• X. Wang, R. Chen, and D. Guo, “Delayed-pilot sampling for
mixture Kalman filter with application in fading channels,” IEEE
Trans. Sig. Proc., vol. 50, pp. 241–253, 2002.
• A. Doucet and S. Senecal, “Fixed-Lag Sequential Monte Carlo”,
Proceedings of EUSIPCO2004.
90