Habilitation à diriger des recherches

Habilitation a diriger des recherches

Pierre Pudlo

Universite Montpellier 2Institut de Mathematiques et Modelisation de Montpellier (I3M)

Institut de Biologie ComputationelleLabex NUMEV

12/12/2014

Pierre Pudlo (UM2) HDR 12/12/2014 1 / 22

Contents

1 Graph based clustering

2 Approximate Bayesian computation

3 Bayesian computation with empirical likelihood


Contents





Graph based clustering

Idea. Define a (weighted) graph which linkssimilar observations Xi, i = 1, . . . ,n.

Example: link Xi and X j iff d(Xi,X j) ≤ hwhere h is a tuning parameter (bandwidth).

To obtain the clusters, cut the graph at aminimal number of edges while partitioningthe data into large groups

The Ncut (Cheeger) for k = 2 groups.Optimize

Ncut(A, A) =cut(A, A)

min(vol(A), vol(A)

)where A is the complement of A in the setof observations.

−2 −1 0 1 2

−1.0

−0.5

0.0

0.5

1.0

Take into account vol(A) and vol(A) to avoidA = {a single outlier}A = {the rest of the data set}


A first asymptotic result

On spectral clustering.The relaxed optimization problembased on the spectral decompositionof the matrix

Qh(i, j) =kh(X j − Xi)∑` kh(X` − Xi)

• von Luxburg et al. (2008) obtained adeterministic limit when

(∗) n→∞ & h is fixed

Looks like the limit of kernel densityestimator fn,h(x), whose limit when h isfixed is ∫

f (y)kh(x − y)dy

•With Bruno Pelletier, we haveobtained asymptotic results under (∗).

We add a pre-processing: remove datapoints in area of low densityCluster the kept data points

Then, the limit does not depend on h &can be characterized geometrically interm of level sets of the true densityf (x).

• But the asymptotic behavior ofspectral clustering when

(∗∗) n→∞ & h→ 0

is much more difficult tounderstand.


A second asymptotic result

On the NP-hard optimizationproblem which minimizes

Ncut(A, A) =cut(A, A)

min(vol(A), vol(A)

)on the graph with bandwidth h.

Assume M is a compact subset of Rd,X1, . . .Xn a uniform random samplingof M.

(Known results if the points form aregular grid of M)

•With Ery Arias-Castro and BrunoPelletier,we have proven the asymptoticconsistency of the Ncut problem when

(∗ ∗ ∗) n→∞ & h→ 0,

h2d+1� 1/

√n

The limit is the solution of the Cheegerproblem(cut M with a surface of minimalperimeter)


Contents





Intractable likelihoodsProblemHow to perform a Bayesian analysis when the likelihood f (y|φ) is intractable?

Example 1. Gibbs random fields

f (y|φ) ∝ exp(−H(y, φ))

is known up to a constant

Z(φ) = 1/∑

y

exp(−H(y, φ))

Example 2. Neutral populationgenetics

Aim. Infer demographic parameters onthe past of some populations based onthe trace left in genomes of individualssampled from current populations.

Latent process (past history of thesample) ∈ space of high dimension.

If y is the genetic data of the sample,the likelihood is

f (y|φ) =∫

Z

f (y, z |φ) dz

Typically, dim(Z )� dim(Y ).

No hope to compute the likelihood withclever Monte Carlo algorithms?


Approximate Bayesian computation

IdeaInfer conditional distribution of φ given yobs on simulations from the joint π(φ) f (y|φ)

ABC algorithmA) Generate a large set of (φ, y)from the Bayesian model

π(φ) f (y|φ)

B) Keep the particles (φ, y) suchthat d(η(yobs), η(y)) ≤ ε

C) Return the φ’s of the keptparticles

Curse of dimensionality: y is replacedby some numerical summaries η(y)

Stage A) is computationally heavy!

We end up rejecting almost allsimulations except if fallen in theneighborhood of η(yobs)

Sequential ABC algorithms try to avoiddrawing φ is area of low π(φ|y).

An auto-calibrated ABC-SMCsampler with Mohammed Sedki,Jean-Michel Marin, Jean-Marie Cornetand Christian P. Robert


ABC sequential sampler

How to calibrate ε1 ≥ ε2 ≥ · · · ≥ εT and T to be efficient? The auto-calibrated ABC-SMC sampler developed with Mohammed Sedki,Jean-Michel Marin, Jean-Marie Cornet and Christian P. Robert


ABC target

Three levels of approximation of theposterior π

(φ

∣∣∣ yobs

)1 the ABC posterior distribution

π(φ

∣∣∣ η(yobs))

2 approximated with a kernel ofbandwidth ε (or with k-nearestneighbours)

π(φ

∣∣∣∣ d(η(y), η(yobs)) ≤ ε)

3 a Monte Carlo error:sample size N < ∞

See, e.g., our review with J.-M. Marin,C. Robert and R. Ryder

If η(y) are not sufficient statistics,

π(φ

∣∣∣ yobs

), π

(φ

∣∣∣ η(yobs))

Information regarding yobs might belost!

Curse of dimensionality:cannot have both ε small, N largewhen η(y) is of large dimension

Post-processing of Beaumont et al.(2002) with local linear regression.

But the lack of sufficiency might still beproblematic. See Robert et al. (2011)for model choice.


ABC model choice

ABC model choiceA) Generate a large set of

(m, φ, y) from the Bayesianmodel, π(m)πm(φ) fm(y|φ)

B) Keep the particles (m, φ, y)such that d(η(y), η(yobs)) ≤ ε

C) For each m, returnpm(yobs) = porportion of mamong the kept particles

Likewise, if η(y) is not sufficient for themodel choice issue,

π(m

∣∣∣ y), π

(m

∣∣∣ η(y))

It might be difficult to designinformative η(y).

Toy example.Model 1. yi

iid∼ N (φ, 1)

Model 2. yiiid∼ N (φ, 2)

Same prior on φ (whatever the model)& uniform prior on model index

η(y) = y1 + · · · + yn is sufficient toestimate φ in both models

But η(y) carries no informationregarding the variance (hence themodel choice issue)

Other examples in Robert et al. (2011)

In population genetics. Might bedifficult to find summary statistics thathelp discriminate between models(= possible historical scenarios on thesampled populations)


ABC model choice

ABC model choice

A) Generate a large set of(m, φ, y) from the Bayesianmodel π(m)πm(φ) fm(y|φ)

B) Keep the particles (m, φ, y)such that d(η(y), η(yobs)) ≤ ε

C) For each m, returnpm(yobs) = porportion of mamong the kept particles

If ε is tuned so that the number of keptparticles is k, then pm is a k-nearestneighbor estimate of

E(1{M = m}

∣∣∣∣η(yobs))

Approximating the posteriorprobabilities of model m is aregression problem whereI the response is 1{M = m},I the co-variables are the summary

statistics η(y),I the loss is L2 (conditional

expectation)

The prefered method to approximatethe postererior probabilities in DIYABCis a local multinomial regression.

Ticklish if dim(η(y)) large, or highcorrelation in the summary statistics.


Choosing between hidden random fields

Choosing between dependencygraph: 4 or 8 neighbours?

Models. α, β ∼ priorz | β ∼ Potts on G4 or G8 with interaction βy | z, α ∼

∏i P(yi|zi, α)

How to sum up the noisy y?

Without noise (directly observed field),sufficient statistics for the model choiceissue.

With Julien Stoehr and Lionel Cucala

a method to design new summarystatistics

Based on a clustering of the observeddata on possible dependency graphs

I number of connected componentsI size of the largest connected

component,I . . .


Machine learning to analyse machine simulated data

ABC model choiceA) Generate a large set of

(m, φ, y) from π(m)πm(φ) fm(y|φ)

B) Infer (anything?) about

m∣∣∣ η(y) with machine learningmethods

In this machine learning perspective:I the (iid) simulations of A) form the

training setI yobs becomes a new data point

With J.-M. Marin, J.-M. Cornuet, A.Estoup, M. Gautier and C. P. Robert

I Predicting m is a classificationproblem

I Computing π(m|η(y)) is aregression problem

It is well known that classification ismuch simple than regression.(dimension of the object we infer)

Why computing π(m|η(y)) if we knowthat

π(m|y) , π(m|η(y))?


An example with random forest on human SNP data

Out of Africa

6 scenarios, 6 models

Observed data. 4 populations, 30individuals per population; 10,000genotyped SNP from the 1000 GenomeProject

Random forest trained on 40, 000simulations (112 summary statistics) predict the model which supportsI a single out-of-Africa colonization

event,I a secondary split between European

and Asian lineages andI a recent admixture for Americans

with African origin

Confidence in the selected model?


Example (continued)

Observed data. 4 populations, 30individuals per population; 10,000genotyped SNP from the 1000 GenomeProject

Random forest trained on 40, 000simulations (112 summary statistics) predict the model which supportsI a single out-of-Africa colonization

event,I a secondary split between European

and Asian lineages andI a recent admixture for Americans

with African origin

Benefits of random forests?1 Can find the relevant statistics in a

large set of statistics (112) todiscriminate models

2 Lower prior misclassification error(≈ 6%) than other methods (ABC, i.e.k-nn ≈ 18%)

3 Supply a similarity measure tocompare η(y) and η(yobs)

Confidence in the selected model?Compute the average of themisclassification error over an ABCapproximation of the predictive (∗). Here,≤ 0.1%

(∗) π(m, φ, y | ηobs) = π(m | ηobs)πm(φ | ηobs) fm(y |φ)


Contents





Another approximation of the likelihood

What if bothI the likelihood is intractable andI unable to simulate a dataset in a reasonable amount of time to resort on ABC?

First answer: use pseudo-likelihoodssuch as the pairwise composite likelihood

fPCL(y |φ) =∏i< j

f (yi, y j |φ)

Maximum composite likelihoodestimators φ(y) are suitable estimators

But cannot substitute a true likelihoodin a Bayesian framework

leads to credible intervals which aretoo narrow: over-confidence in φ(y), seee.g. Ribatet et al. (2012)

Our proposal with Kerrie Mengersen andChristian P. Robert:use the empirical likelihood of Owen(2001, 2011)

It relies on iid blocks in the dataset y toreconstruct a likelihood& permits likelihood ratio tests confidence intervals are correct

Original aim of Owen: remove parametricassumptions


Bayesian computation via empirical likelihood

Our proposal with Kerrie Mengersen andChristian P. Robert:use the empirical likelihood of Owen(2001, 2011)

It relies on iid blocks in the dataset y toreconstruct a likelihood& permits likelihood ratio tests confidence intervals are correct

Original aim of Owen: remove parametricassumptions

With empirical likelihood, the parameter φis defined as

(∗) E(h(yb, φ)

)= 0

whereI yb is one block of y,I E the expected value according to

the true distribution of the block yb

I h is a known function

E.g, if φ is the mean of an iid sample,h(yb, φ) = yb − φ

In population genetics, what is (∗) withI dates of population splitsI population sizes, etc. ?


Bayesian computation via empirical likelihood

With empirical likelihood, the parameter φis defined as

(∗) E(h(yb, φ)

)= 0

whereI yb is one block of y,I E the expected value according to

the true distribution of the block yb

I h is a known function

E.g, if φ is the mean of an iid sample,h(yb, φ) = yb − φ

In population genetics, what is (∗) withI dates of population splitsI population sizes, etc. ?

A block = genetic data at given locus

h(yb, φ) is the pairwise composite scorefunction we can explicitly compute inmany situations:

h(yb, φ) = ∇φ log fPCL(yb |φ)

Benefits.I much faster than ABC (no need to

simulate fake data)I same accuracy than ABC or even

much precise: no loss of informationwith summary statistics


An experiment

Evolutionary scenario:MRCA

POP 0 POP 1 POP 2

τ1

τ2

Dataset:I 50 genes per populations,I 100 microsat. loci

Assumptions:I Ne identical over all populationsI φ = log10(θ, τ1, τ2)I non-informative prior

Comparison of ABC and EL

histogram = ELcurve = ABCvertical line = “true” parameter


An experiment


POP 0 POP 1 POP 2

τ1

τ2






An experiment


POP 0 POP 1 POP 2

τ1

τ2






Science

Habilitation à diriger des recherches