Upload
pierre-pudlo
View
48
Download
2
Embed Size (px)
Citation preview
Habilitation a diriger des recherches
Pierre Pudlo
Universite Montpellier 2Institut de Mathematiques et Modelisation de Montpellier (I3M)
Institut de Biologie ComputationelleLabex NUMEV
12/12/2014
Pierre Pudlo (UM2) HDR 12/12/2014 1 / 22
Contents
1 Graph based clustering
2 Approximate Bayesian computation
3 Bayesian computation with empirical likelihood
Pierre Pudlo (UM2) HDR 12/12/2014 2 / 22
Contents
1 Graph based clustering
2 Approximate Bayesian computation
3 Bayesian computation with empirical likelihood
Pierre Pudlo (UM2) HDR 12/12/2014 3 / 22
Graph based clustering
Idea. Define a (weighted) graph which linkssimilar observations Xi, i = 1, . . . ,n.
Example: link Xi and X j iff d(Xi,X j) ≤ hwhere h is a tuning parameter (bandwidth).
To obtain the clusters, cut the graph at aminimal number of edges while partitioningthe data into large groups
The Ncut (Cheeger) for k = 2 groups.Optimize
Ncut(A, A) =cut(A, A)
min(vol(A), vol(A)
)where A is the complement of A in the setof observations.
−2 −1 0 1 2
−1.0
−0.5
0.0
0.5
1.0
Take into account vol(A) and vol(A) to avoidA = {a single outlier}A = {the rest of the data set}
Pierre Pudlo (UM2) HDR 12/12/2014 4 / 22
A first asymptotic result
On spectral clustering.The relaxed optimization problembased on the spectral decompositionof the matrix
Qh(i, j) =kh(X j − Xi)∑` kh(X` − Xi)
• von Luxburg et al. (2008) obtained adeterministic limit when
(∗) n→∞ & h is fixed
Looks like the limit of kernel densityestimator fn,h(x), whose limit when h isfixed is ∫
f (y)kh(x − y)dy
•With Bruno Pelletier, we haveobtained asymptotic results under (∗).
We add a pre-processing: remove datapoints in area of low densityCluster the kept data points
Then, the limit does not depend on h &can be characterized geometrically interm of level sets of the true densityf (x).
• But the asymptotic behavior ofspectral clustering when
(∗∗) n→∞ & h→ 0
is much more difficult tounderstand.
Pierre Pudlo (UM2) HDR 12/12/2014 5 / 22
A second asymptotic result
On the NP-hard optimizationproblem which minimizes
Ncut(A, A) =cut(A, A)
min(vol(A), vol(A)
)on the graph with bandwidth h.
Assume M is a compact subset of Rd,X1, . . .Xn a uniform random samplingof M.
(Known results if the points form aregular grid of M)
•With Ery Arias-Castro and BrunoPelletier,we have proven the asymptoticconsistency of the Ncut problem when
(∗ ∗ ∗) n→∞ & h→ 0,
h2d+1� 1/
√n
The limit is the solution of the Cheegerproblem(cut M with a surface of minimalperimeter)
Pierre Pudlo (UM2) HDR 12/12/2014 6 / 22
Contents
1 Graph based clustering
2 Approximate Bayesian computation
3 Bayesian computation with empirical likelihood
Pierre Pudlo (UM2) HDR 12/12/2014 7 / 22
Intractable likelihoodsProblemHow to perform a Bayesian analysis when the likelihood f (y|φ) is intractable?
Example 1. Gibbs random fields
f (y|φ) ∝ exp(−H(y, φ))
is known up to a constant
Z(φ) = 1/∑
y
exp(−H(y, φ))
Example 2. Neutral populationgenetics
Aim. Infer demographic parameters onthe past of some populations based onthe trace left in genomes of individualssampled from current populations.
Latent process (past history of thesample) ∈ space of high dimension.
If y is the genetic data of the sample,the likelihood is
f (y|φ) =∫
Z
f (y, z |φ) dz
Typically, dim(Z )� dim(Y ).
No hope to compute the likelihood withclever Monte Carlo algorithms?
Pierre Pudlo (UM2) HDR 12/12/2014 8 / 22
Approximate Bayesian computation
IdeaInfer conditional distribution of φ given yobs on simulations from the joint π(φ) f (y|φ)
ABC algorithmA) Generate a large set of (φ, y)from the Bayesian model
π(φ) f (y|φ)
B) Keep the particles (φ, y) suchthat d(η(yobs), η(y)) ≤ ε
C) Return the φ’s of the keptparticles
Curse of dimensionality: y is replacedby some numerical summaries η(y)
Stage A) is computationally heavy!
We end up rejecting almost allsimulations except if fallen in theneighborhood of η(yobs)
Sequential ABC algorithms try to avoiddrawing φ is area of low π(φ|y).
An auto-calibrated ABC-SMCsampler with Mohammed Sedki,Jean-Michel Marin, Jean-Marie Cornetand Christian P. Robert
Pierre Pudlo (UM2) HDR 12/12/2014 9 / 22
ABC sequential sampler
How to calibrate ε1 ≥ ε2 ≥ · · · ≥ εT and T to be efficient? The auto-calibrated ABC-SMC sampler developed with Mohammed Sedki,Jean-Michel Marin, Jean-Marie Cornet and Christian P. Robert
Pierre Pudlo (UM2) HDR 12/12/2014 10 / 22
ABC target
Three levels of approximation of theposterior π
(φ
∣∣∣ yobs
)1 the ABC posterior distribution
π(φ
∣∣∣ η(yobs))
2 approximated with a kernel ofbandwidth ε (or with k-nearestneighbours)
π(φ
∣∣∣∣ d(η(y), η(yobs)) ≤ ε)
3 a Monte Carlo error:sample size N < ∞
See, e.g., our review with J.-M. Marin,C. Robert and R. Ryder
If η(y) are not sufficient statistics,
π(φ
∣∣∣ yobs
), π
(φ
∣∣∣ η(yobs))
Information regarding yobs might belost!
Curse of dimensionality:cannot have both ε small, N largewhen η(y) is of large dimension
Post-processing of Beaumont et al.(2002) with local linear regression.
But the lack of sufficiency might still beproblematic. See Robert et al. (2011)for model choice.
Pierre Pudlo (UM2) HDR 12/12/2014 11 / 22
ABC model choice
ABC model choiceA) Generate a large set of
(m, φ, y) from the Bayesianmodel, π(m)πm(φ) fm(y|φ)
B) Keep the particles (m, φ, y)such that d(η(y), η(yobs)) ≤ ε
C) For each m, returnpm(yobs) = porportion of mamong the kept particles
Likewise, if η(y) is not sufficient for themodel choice issue,
π(m
∣∣∣ y), π
(m
∣∣∣ η(y))
It might be difficult to designinformative η(y).
Toy example.Model 1. yi
iid∼ N (φ, 1)
Model 2. yiiid∼ N (φ, 2)
Same prior on φ (whatever the model)& uniform prior on model index
η(y) = y1 + · · · + yn is sufficient toestimate φ in both models
But η(y) carries no informationregarding the variance (hence themodel choice issue)
Other examples in Robert et al. (2011)
In population genetics. Might bedifficult to find summary statistics thathelp discriminate between models(= possible historical scenarios on thesampled populations)
Pierre Pudlo (UM2) HDR 12/12/2014 12 / 22
ABC model choice
ABC model choice
A) Generate a large set of(m, φ, y) from the Bayesianmodel π(m)πm(φ) fm(y|φ)
B) Keep the particles (m, φ, y)such that d(η(y), η(yobs)) ≤ ε
C) For each m, returnpm(yobs) = porportion of mamong the kept particles
If ε is tuned so that the number of keptparticles is k, then pm is a k-nearestneighbor estimate of
E(1{M = m}
∣∣∣∣η(yobs))
Approximating the posteriorprobabilities of model m is aregression problem whereI the response is 1{M = m},I the co-variables are the summary
statistics η(y),I the loss is L2 (conditional
expectation)
The prefered method to approximatethe postererior probabilities in DIYABCis a local multinomial regression.
Ticklish if dim(η(y)) large, or highcorrelation in the summary statistics.
Pierre Pudlo (UM2) HDR 12/12/2014 13 / 22
Choosing between hidden random fields
Choosing between dependencygraph: 4 or 8 neighbours?
Models. α, β ∼ priorz | β ∼ Potts on G4 or G8 with interaction βy | z, α ∼
∏i P(yi|zi, α)
How to sum up the noisy y?
Without noise (directly observed field),sufficient statistics for the model choiceissue.
With Julien Stoehr and Lionel Cucala
a method to design new summarystatistics
Based on a clustering of the observeddata on possible dependency graphs
I number of connected componentsI size of the largest connected
component,I . . .
Pierre Pudlo (UM2) HDR 12/12/2014 14 / 22
Machine learning to analyse machine simulated data
ABC model choiceA) Generate a large set of
(m, φ, y) from π(m)πm(φ) fm(y|φ)
B) Infer (anything?) about
m∣∣∣ η(y) with machine learningmethods
In this machine learning perspective:I the (iid) simulations of A) form the
training setI yobs becomes a new data point
With J.-M. Marin, J.-M. Cornuet, A.Estoup, M. Gautier and C. P. Robert
I Predicting m is a classificationproblem
I Computing π(m|η(y)) is aregression problem
It is well known that classification ismuch simple than regression.(dimension of the object we infer)
Why computing π(m|η(y)) if we knowthat
π(m|y) , π(m|η(y))?
Pierre Pudlo (UM2) HDR 12/12/2014 15 / 22
An example with random forest on human SNP data
Out of Africa
6 scenarios, 6 models
Observed data. 4 populations, 30individuals per population; 10,000genotyped SNP from the 1000 GenomeProject
Random forest trained on 40, 000simulations (112 summary statistics) predict the model which supportsI a single out-of-Africa colonization
event,I a secondary split between European
and Asian lineages andI a recent admixture for Americans
with African origin
Confidence in the selected model?
Pierre Pudlo (UM2) HDR 12/12/2014 16 / 22
Example (continued)
Observed data. 4 populations, 30individuals per population; 10,000genotyped SNP from the 1000 GenomeProject
Random forest trained on 40, 000simulations (112 summary statistics) predict the model which supportsI a single out-of-Africa colonization
event,I a secondary split between European
and Asian lineages andI a recent admixture for Americans
with African origin
Benefits of random forests?1 Can find the relevant statistics in a
large set of statistics (112) todiscriminate models
2 Lower prior misclassification error(≈ 6%) than other methods (ABC, i.e.k-nn ≈ 18%)
3 Supply a similarity measure tocompare η(y) and η(yobs)
Confidence in the selected model?Compute the average of themisclassification error over an ABCapproximation of the predictive (∗). Here,≤ 0.1%
(∗) π(m, φ, y | ηobs) = π(m | ηobs)πm(φ | ηobs) fm(y |φ)
Pierre Pudlo (UM2) HDR 12/12/2014 17 / 22
Contents
1 Graph based clustering
2 Approximate Bayesian computation
3 Bayesian computation with empirical likelihood
Pierre Pudlo (UM2) HDR 12/12/2014 18 / 22
Another approximation of the likelihood
What if bothI the likelihood is intractable andI unable to simulate a dataset in a reasonable amount of time to resort on ABC?
First answer: use pseudo-likelihoodssuch as the pairwise composite likelihood
fPCL(y |φ) =∏i< j
f (yi, y j |φ)
Maximum composite likelihoodestimators φ(y) are suitable estimators
But cannot substitute a true likelihoodin a Bayesian framework
leads to credible intervals which aretoo narrow: over-confidence in φ(y), seee.g. Ribatet et al. (2012)
Our proposal with Kerrie Mengersen andChristian P. Robert:use the empirical likelihood of Owen(2001, 2011)
It relies on iid blocks in the dataset y toreconstruct a likelihood& permits likelihood ratio tests confidence intervals are correct
Original aim of Owen: remove parametricassumptions
Pierre Pudlo (UM2) HDR 12/12/2014 19 / 22
Bayesian computation via empirical likelihood
Our proposal with Kerrie Mengersen andChristian P. Robert:use the empirical likelihood of Owen(2001, 2011)
It relies on iid blocks in the dataset y toreconstruct a likelihood& permits likelihood ratio tests confidence intervals are correct
Original aim of Owen: remove parametricassumptions
With empirical likelihood, the parameter φis defined as
(∗) E(h(yb, φ)
)= 0
whereI yb is one block of y,I E the expected value according to
the true distribution of the block yb
I h is a known function
E.g, if φ is the mean of an iid sample,h(yb, φ) = yb − φ
In population genetics, what is (∗) withI dates of population splitsI population sizes, etc. ?
Pierre Pudlo (UM2) HDR 12/12/2014 20 / 22
Bayesian computation via empirical likelihood
With empirical likelihood, the parameter φis defined as
(∗) E(h(yb, φ)
)= 0
whereI yb is one block of y,I E the expected value according to
the true distribution of the block yb
I h is a known function
E.g, if φ is the mean of an iid sample,h(yb, φ) = yb − φ
In population genetics, what is (∗) withI dates of population splitsI population sizes, etc. ?
A block = genetic data at given locus
h(yb, φ) is the pairwise composite scorefunction we can explicitly compute inmany situations:
h(yb, φ) = ∇φ log fPCL(yb |φ)
Benefits.I much faster than ABC (no need to
simulate fake data)I same accuracy than ABC or even
much precise: no loss of informationwith summary statistics
Pierre Pudlo (UM2) HDR 12/12/2014 21 / 22
An experiment
Evolutionary scenario:MRCA
POP 0 POP 1 POP 2
τ1
τ2
Dataset:I 50 genes per populations,I 100 microsat. loci
Assumptions:I Ne identical over all populationsI φ = log10(θ, τ1, τ2)I non-informative prior
Comparison of ABC and EL
histogram = ELcurve = ABCvertical line = “true” parameter
Pierre Pudlo (UM2) HDR 12/12/2014 22 / 22
An experiment
Evolutionary scenario:MRCA
POP 0 POP 1 POP 2
τ1
τ2
Dataset:I 50 genes per populations,I 100 microsat. loci
Assumptions:I Ne identical over all populationsI φ = log10(θ, τ1, τ2)I non-informative prior
Comparison of ABC and EL
histogram = ELcurve = ABCvertical line = “true” parameter
Pierre Pudlo (UM2) HDR 12/12/2014 22 / 22
An experiment
Evolutionary scenario:MRCA
POP 0 POP 1 POP 2
τ1
τ2
Dataset:I 50 genes per populations,I 100 microsat. loci
Assumptions:I Ne identical over all populationsI φ = log10(θ, τ1, τ2)I non-informative prior
Comparison of ABC and EL
histogram = ELcurve = ABCvertical line = “true” parameter
Pierre Pudlo (UM2) HDR 12/12/2014 22 / 22