Asymptotics for discrete random measures


Asymptotics for discrete random measures

Pierpaolo De Blasiemail:

Statalk on Bayesian Nonparametrics,Collegio Carlo Alberto, February 19, 2016

1. Introduction• Dirichlet process PD(0, θ)• Stick-breaking representation• Two-parameter Poisson-Dirichlet process PD(α, θ)• Truncation error Rn

• Almost sure approximations2. Applications in PD(α, θ) mixtures

• Blocked Gibbs sampler• Slice sampler• Posterior asymptotics

3. Asymptotics• Limiting distribution• Large deviation

Dirichlet process

The Dirichlet process (DP) defines a prior distribution on the space ofprobability measure on R (or Rd or any Polish space).

Let θ > 0 and H(·) a probability measure on R,

P ∼ DP(θH)

when for any partition B1, . . . ,Bd of R,

(P(B1), . . . ,P(Bd )) ∼ Dirichlet (θH(B1), . . . , θH(Bd )) .

P is with probability one a discrete probability measure which admits aninfinite sum representation:

P(·) =∑

j≥1pjδzj (·)

where• (pj ) = (p1, p2, . . .) is a random vector taking value in the infinite

probability simplex ∆∞ = {pj ≥ 0,∑

j≥1 pj = 1}• (zj ) = (z1, z2, . . .) is an iid sequence from H(·) independent of (pj ).

Ferguson and Klass sum representation

P(·) =∑j≥1

pj δzj (·) =∑j≥1

Jj∑`≥1 J`

δzj (·)

with (Jj ) the jumps of a gamma process {ξt , t ∈ [0, 1]} (θ = 1 here),

E(e−sξt ) =(1 + s

)−t= e−t

∫∞0 (1−e−su)ν(du), ν(du) = u−1e−u du.

• Γj = E1 + · · ·Ej , for E1,E2, . . . ∼iid Exp(1), and N(x) =∫∞

x u−1e−u duthe right tail of the Levy measure ν(du).

• ThenJj = N−1(Γj ), J1 > J2 > . . .

so that the normalized weights

pj = N−1(Γj )/∑


are a.s. decreasing, p1 > p2 > . . ., and define a distribution on ∆∞

known as the Poisson Dirichlet distribution.• No closed form solution for N−1(u). Each pj requires the computation of

an infinite sum.

Stick-breaking sum representationAlso known as residual allocation model,

P(·) =∑j≥1

p̃j δzj (·) =∑j≥1



(1− v`) δzj (·)

where vj ∼iid beta(1, θ).• The stick-breaking weights

p̃j = vj


`=1(1− v`),

have decreasing expected value, E(p̃1) > E(p̃2) > . . ., and defines adistribution on ∆∞ known as the GEM distribution.

• Correspondence with Ferguson and Klass weights (pj ) in size-biasedrandom order∗, that is in the order they appear in a multinomial sampling.

• Convenient for simulation purposes: no need to compute an infinite sum.

size-biased random order∗:

P(p̃1 = pj |p1, p2, . . .) = pj

P(p̃j+1 = pi |p̃1, . . . , p̃j ; p1, p2, . . .) =pi1(pj 6= p` for all 1 ≤ ` ≤ j)

1− p̃1 − · · · − p̃j

Two-parameter Poisson-Dirichlet process

Also known as Pitman-Yor process after Pitman and Yor (1997),

P(·) =∑j≥1

p̃j δzj (·) =∑j≥1



(1− v`) δzj (·)

where vj ∼ind beta(1− α, θ + jα) for α ∈ (0, 1) and θ > −α.• The distribution of the ordered sequence p1 > p2 > . . . where pj = p̃(j) is

known as the two-parameter Poisson-Dirichlet ditribution.• It admits a representation as normalized jumps of a process with

independent increments only for θ = 0 (stable process)

pj = Γ−1/αj



• Since vj are stochastically decreasing, E(p̃j ) has a slower decaycompared to the DP case.

We use• PD(0, θ) for the Dirichlet process;• PD(α, θ) for the two-parameter Poisson-Dirichlet process.

whenever we refer to the corresponding distribution of (pj ) or (p̃j ) on ∆∞.

In both cases, and unless otherwise specified, the locations (zj ) are taken asiid draws from H(·).

Whenever there is

. . .

an open problem is pointed out.

Truncation error

Define the residual probability or truncation error as

Rn =∑j>n

p̃j =n∏


(1− vj )

For ε > 0, define the counting process

Nε = inf{n ∈ N : Rn < ε},

When the vj are identically distributed (i.e. the DP case), Nε takes on theinterpretation of the renewal process Xt evaluated at t = − log ε with

• independent interarrival times distributed as − log(1− vj )

• n-th arrival times Tn distributed as − log Rn.

PD(0, θ)

• Since (1− vj ) ∼iid beta(θ, 1), − log(1− vj ) ∼iid Exp(θ) and

− log Rn ∼ Gamma(n, θ)

• Renewal process interpretation:

P(Tn ≤ t) = 1− P(Xt ≤ n − 1)

Tn ∼ Gamma(n, θ), Xt ∼ Pois(θ t)

where Tn = − log Rn, Xt = Nε for t = log(1/ε).• Hence Nε − 1 ∼ Pois(θ log(1/ε)): the smaller ε, the larger the number Nε

of weights p̃j needed to account for 1− ε probability.

PD(α, θ)

• (1− vj ) ∼ beta(θ + jα, 1− α), so − log(1− vj ) is a positive r.v. which isnot close under convolution. It is infinitely divisible, see Lemma 1 inFerguson (1974).

• − log Rn does not correspond to the n-th arrival time of a Poissonprocess, rather Nε is a counting process with independent arrival times.

• StillP(− log Rn ≤ log(1/ε)) = 1− P(Nε ≤ n − 1)

However, no closed form distribution for − log Rn and Nε.

We will be focussing on the distribution of Rn, seeking an asymptoticapproximations as n→∞:

1. limiting distribution

2. large deviation principle

In this, we keep the PD(0, θ) case as running example for comparison andillustration.

Truncation error and a.s. approximationStick breaking representation suggests the almost sure truncation

PN(·) =∑N

j=1p̃j δzj (·) + RN δz0 (·)




`=1(1− v`) δzj (·) +


j=1(1− vj ) δz0 (·)

obtained by setting vN+1 = 1 so that p̃N+1 = 1−∑N

j=1 p̃j = Rn.• Muliere and Tardella (1998): sampling functionals of P like

P(g) =

∫g(x)P(dx) =


p̃jg(zj )

• For each bounded and continuous real valued function g,

PN(g)→a.s. P(g)

• Let Pε = PNε for Nε = inf{n ∈ N : Rn < ε}. Then

dTV (Pε,P) ≤ ε, a.s.

PD(α, θ) mixtures for density estimation

xi |yi ∼ f (xi |yi ), yi |P ∼ P, i = 1, 2, . . . , n

P ∼ P(1)

where Yi are latent variables (non observable).• By exploiting the stick breaking sum representation,

f (xi ) =

∫f (xi |yi )dP(yi ), P(·) =


p̃j δzj (·)

orf (xi ) =


p̃j f (xi |zj ), (p̃j ) ∼ PD(α, θ), (zj ) ∼iid H(·)

• Rewrite (1) as

xi |(p̃j ), (zj ) ∼∑j≥1

p̃j f (x |zj ), i = 1, 2, . . . , n

(p̃j ) ∼ PD(α, θ), (zj ) ∼iid H(·)(2)

Truncation applied to the posterior

Ishwaran and James (2001), Gelfand and Kottas (2002).• Marginal sampler: avoid dealing with an infinite number of parameters by

integrating out the unknown mixing P, i.e. the sequences (p̃j ) and (zj ):

xi |yi ∼ f (xi |yi ), i = 1, 2, . . . , n

y1, . . . yn ∼ PU(α, θ,H)

where PU(α, θ,H) is the prediction rule of the Polya Urn.• Let

π(dy |x)

be the posterior distribution of y = (y1, . . . , yn) given x = (x1, . . . , xn).• Inference for the posterior of P based only on the posterior yi values:

PD(0, θ)P(·|y) ∼ DP(θH + nPn)

where Pn is the empirical measure of y = (y1, . . . , yn).

• PD(α, θ)

P(·|y) =k∑


p̃∗j δζj (·) + p̃∗k+1P∗(·)

where ζ1, · · · , ζk are the unique set of yi values with frequenciesn∗1 , . . . , n

∗k , and

(p̃∗1 , . . . , p̃∗k , p̃∗k+1) ∼ Dirichlet(n∗1 − α, . . . , n∗k − α, θ + αk)

is independent of P∗, which is PD(α, θ + kα).• Thus, to approximate functionals of the posterior P(·|y), use

PN(·|y) =∑k

j=1p̃∗j δζj (·) + p̃∗k+1P∗N(·)


P∗N(·) =∑N

j=1p̃j δzj (·) + RN δz0 (·)




`=1(1− v`) δzj (·) +


j=1(1− vj ) δz0 (·)

for vj ∼ beta(1− α, θ + kα + jα).

Truncation applied to the prior

Blocked Gibbs sampler by Ishwaran and James (2001).• Conditional sampler: P is not marginalized out, rather replaced with its

almost sure truncation in the prior

xi |yi ∼ f (xi |yi ), yi |P ∼ P, i = 1, 2, . . . , n

P ∼ PN(3)

• N can be chosen sufficiently large so that the L1 distance ‖ · ‖1 betweenthe marginal densities

mN (x) =∫ {∏n

i=1f (xi |yi )dP(yi )

}dPN (P), m∞(x) =

∫ {∏n

i=1f (xi |yi )dP(yi )


is small.• It can be shown that

‖mN(x)−m∞(x)‖1 ≤ 4(1− E

[(1− RN)n])

PD(0, θ) (1− E

[(1− RN)n]) ≈ ne−N/θ, as N →∞

based on

1− RN =d 1− exp(−E1/θ) · · · exp(−EN/θ) ≈ 1− exp(−N/θ)

by the law of large number since E1, . . . ,EN are iid Exp(1).

PD(α, θ)

• Asymptotic evaluation of E [(1− RN)n] not available.• By direct calculation,

E[(RN)r ] =N∏


(θ + jα)(r)(θ + (j − 1)α + 1)(r)

wherex(r) =

Γ(x + r)

Γ(x)= x(x + 1) · · · (x + r − 1)

• . . .

Slice sampler

Walker (2007), Kalli, Griffin and Walker (2011).

• Exact conditional sampler: the truncation level is adaptive, included inthe model via auxiliary parameters.

• Start with hierarchical representation

xi |(p̃j ), (zj ) ∼∑j≥1

p̃j f (x |zj ), i = 1, 2, . . . , n

(p̃j ) ∼ PD(α, θ), (zj ) ∼iid H(·)

• Data augmentation 1: introduce uniform r.v. u1, . . . , un ∈ [0, 1],

xi , ui |(p̃j ), (zj ) ∼∑j≥1

1(ui ≤ p̃j ) f (x |zj ), i = 1, 2, . . . , n

• Data augmentation 2: introduce allocation variables d1, . . . , dn ∈ N suchthat P(di = j) = p̃j ,

xi , ui , di |(p̃j ), (zj ) ∼ 1(ui ≤ p̃di ) f (x |zdi ), i = 1, 2, . . . , n

• Gibbs sampler on the augmented parameter space {p̃j , zj , ui , di}• Convenient to work with {vj , zj , ui , di} according to p̃j = vj

∏j−1`=1(1− v`).

• Full conditionals: let nj =∑n

i=1 1(di = j)

π(vj |rest) ∼ beta(

1− α + nj , θ + jα +∑


π(ui |rest) ∼ Unif(0, p̃di ),

π(di = j|rest) =1(p̃j > ui )f (xi |zj )∑

`∈Auif (xi |z`)

, Au = {j : p̃j > u}

• We need to account only for a finite number N of components insampling π(di = j|rest), that is

⋃Aui ⊂ {1, . . . ,N} with N such that

RN < u(1) = min{u1, . . . , un}

• Note that N = Nu(1) = inf{n ∈ N : Rn < u(1)}• However RN is not as in the prior but with respect to vj |rest. So the

• . . .

. . .

Efficient slice sampler

• Let (ξj ) = (ξ1, ξ2, . . .) be an decreasing positive sequence and write

xi , ui , di |(p̃j ), (zj ) ∼ 1(ui ≤ ξdi )p̃di


f (x |zdi ), i = 1, 2, . . . , n

i.e. recover previous version for the (random) sequence (ξj ) = (p̃j )

• The full conditionals now are

π(ui |rest) ∼ Unif(0, ξdi )

π(di = j|rest) =1(ξj > ui )


f (xi |zj )∑Ni`=1 f (xi |z`)

, Ni = inf{n ∈ N : ξj > ui}

• Ni is deterministic given ui . For ξj = e−ξj ,

Ni = b−(1/ξ) log uic, and N = max{Ni} = b−(1/ξ) log u(1)c

compared to N = inf{n ∈ N : Rn < u(1)}.

• The choice of the sequence (ξj ), or of the constant ξ in

ξj = e−ξj ,

is however a delicate issue.• Balance between mixing and computational time of the sampler.• The mixing of the sampler depends on how the ratio

E(p̃j )/ξj = E(p̃j )/e−ξj

increases with j : faster rates (larger ξ) are associated with better mixingbut longer running time.

• The study of the asymptotic behavior of RN should give some guidelinesin the choice of the sequence (ξj ).

• . . .

Posterior asymptotics

• PD(α, θ) mixture of normal densities

f (x) = fP,σ(x) =

∫1σφ(x − y



P(·) =∑j≥1

p̃j δzj (·), σ ∼ π

• Assume xn = (x1, . . . , xn) are iid from some f0(x).• Let Π be the distribution on space of densities F induced by P and π.• Interest is in establishing that the posterior Π(·|xn) accumulates in L1

neighborhood of f0:

Π (f : ‖f − f0‖1 ≤ εn|xn)→ 1,

where εn → 0, nε2n →∞, is the posterior convergence rate.

• Standard sufficient conditions for posterior convergence involves theprior Π and the regularity of f0.

Ghosal, Ghosh and van der Vaart (2000).• Prior contraction rates


f :∫

f0 log(f0/f ) ≤ ε̃2n,∫

f0 log(f0/f )2 ≤ ε̃2n

)≥ e−nε̃2


where ε̃n ≤ εn.• PD(0, θ)

When f0 is β-smooth, ε̃n = n−β/(1+2β)(log n)t .• Low entropy - high mass sieve:

let (Fn) be a sequence of sets Fn ⊂ F with Fn ↑ F such that

Π(Fcn ) ≤ e−4nε̃2


log N(εn,Fn, ‖ · ‖1) ≤ nε2n

where N(εn,Fn, ‖ · ‖1) is the entropy of Fn, the smallest number ofεn-balls needed to cover Fn in the L1-metric.

Low entropy - high mass sieve

Shen, Tokdar and Ghosal (2013).• For P =

∑j≥1 p̃jδzj , (zj ) ∼iid H,

Fn ={

fP,σ(x) =∫ 1σφ( x−y

σ)dP(y) : Rmn =


p̃j ≤ εn,

σn ≤ σ ≤ σn (1 + εn)Mn and zj ∈ [−an, an], j ≤ mn

}where Mn = σ−1

n = an = n.• PD(0, θ)

For εn = n−γ(log n)t , γ ∈ (0, 1/2), set mn = bnε2n/ log nc. Then

log N(εn,Fn, ‖ · ‖1) = mn log(1/εn) + other things . nε2n

Π(Fcn ) = P(Rmn > εn) + other things . e−(1−2γ)nε2


as it can be shown that P(Rmn > εn) . e−(1−2γ)nε2n .

Limiting distribution PD(0, θ)

− log Rn − n/θ√n/θ

→d Z , Z ∼ N(0, 1)

• Follows from CLT since − log Rn ∼ Gamma(n, θ).

• For a > 0 and Φ(z) =∫ z−∞(2π)−1/2e−x2/2dx ,

P(Rn > e−n/θea√

n︸ ︷︷ ︸εn

)→ Φ(−θa)

that is εn = e−n/θsn → 0 for sn = ea√

n →∞.• As for Nε, as ε→ 0,

Nε − 1− θ log(1/ε)√θ log(1/ε)

→d Z , Z ∼ N(0, 1)

Is it possible to obtain this CLT from the CLT of Rn and the relationP(− log Rn ≤ log(1/ε)) = 1− P(Nε ≤ n − 1).

Limiting distribution PD(α, θ)

THEOREM 1. Let Tα be a stable random variable with exponent α,Ee−uTα = e−uα . Also, let Tα,θ be a polynomially tilted version of Tα, i.e., forfα the density of Tα, the density of Tα,θ is proportional to t−θfα(t). Then

n1−αα Rn →a.s. α(Tα,θ)−1

• See Lemma 3.11, Pitman (2006), proof based on Kingman’s paintboxrepresentation (Kingman, 1978) and results in Gnedin, Hansen andPitman (2007) for deterministic (pj ) ∈ ∆∞.

• Direct proof via the moment generating function of − log Rn.• For c > 1,

P(Rn > n−1−αα c︸ ︷︷ ︸εn

)→ P(Tα,θ < α/c)

compared to P(Rn > εn)→ Φ(−θa) for εn = e−n/θea√

n in the PD(0, θ)case.

• For the case PD(α, 0), the weights (p̃j ) can be represented as

p̃j =J̃j

Tα, (J̃j ) =d (Γ

−1/αj ) in size-biased random order

henceTαRn =


J̃j ,=⇒ n1−αα


J̃j →a.s. α

the small jumps of the stable process in a size-biased random order,once properly rescaled by n

1−αα , converge to a proportion equal to α.

• Interestingly, α is also the proportion of singletons, or dust, with respectto the number of unique values, say Kn, in a multinomial sample of size nfrom the PD(α, θ) distribution:{

n−αMn,1 →a.s. α(Tα,θ)−α

n−αKn →a.s. (Tα,θ)−α,


Kn→a.s. α

where Mn,k = #{j : nj = k , j = 1, . . .Kn} such that∑

k kMn,k = n.

Large deviations PD(0, θ)

• For a < 1/θ,


log P(− log Rn

n≤ a

)→ −[θa− 1− log(θa)]

i.e. − log Rn satisfies a Large Deviation Principle (LDP) with speed n andrate function I(x) = θx − 1− log(θx) > 0 for θx < 1.

• HenceP(Rn > e−n/θe(1/θ−a)n︸ ︷︷ ︸


) � e−I(a)n

that is εn = e−n/θsn → 0 for sn = ecn →∞ for c = 1/θ − a > 0.• Beyond large deviation, εn = n−γ ,

P(Rn > εn) = P(Gamma(n, θ) < γ log n)

≤ (θγ log n)n

Γ(n + 1)� 1√

nexp{−n log n}

by Stirling’s formula.

Large deviations PD(α, θ)

THEOREM 2. Let I(x) = (x − 1−αα

)(θ + α) and J(x) = 1− x α1−α .

1log n

log P(− log Rn

log n≥ a

)→ −I(a), a > 1−α


1log n

log[− 1

1− α log P(− log Rn

log n≤ a

)]→ J(a), a < 1−α


• Part (i) is a LDP for − log Rn with speed log n and rate function I(x). SeeDembo and Zeitouni (2010, Chp. 2).

• It corresponds to


Rn ≤ n−1−αα n−c

)� n−(θ+α)c , c > 0

that is the probability that Rn is smaller than a negative power of nsmaller than its long run behavior n−

1−αα vanishes polynomially fast.

• Part (ii) is a non standard LDP which corresponds to

P(Rn > n−1−αα nc︸ ︷︷ ︸εn

) � e−(1−α)nα

1−α c

, 0 < c < 1−αα

• It is of direct use for posterior asymptotics: for εn = n−γ(log n)t ,γ ∈ (0, 1/2), the sequence mn which satisfies

P(Rmn > εn) . e−cnε2n

is given bymn = bnετn (log n)ctc, for τ = 2− α

1− αand c = α


1−2γ .

• Compared with the PD(0, θ) case, where mn = bnε2n/ log nc,

mn = bnετn (log n)ctc grows faster since τ < 2, and, in particular,

log N(εn,Fn, ‖ · ‖1) . mn log(1/εn)

= nε2n n


γ(log n)α


1−2γ t+1

• . . .

