VAE: Encoding stochastic process priors with variational ... · In theory, they provide ﬂexible priors over function classes that can encode a wide range of interesting as-

πVAE: Encoding stochastic process priors with variational autoencoders

Swapnil Mishra * 1 Seth Flaxman * 2 Samir Bhatt * 1

AbstractStochastic processes provide a mathematicallyelegant way model complex data. In theory,they provide flexible priors over function classesthat can encode a wide range of interesting as-sumptions. In practice, however, efficient infer-ence by optimisation or marginalisation is dif-ficult, a problem further exacerbated with bigdata and high dimensional input spaces. Wepropose a novel variational autoencoder (VAE)called the prior encoding variational autoencoder(πVAE). The πVAE is finitely exchangeable andKolmogorov consistent, and thus is a continuousstochastic process. We use πVAE to learn lowdimensional embeddings of function classes. Weshow that our framework can accurately learn ex-pressive function classes such as Gaussian pro-cesses, but also properties of functions to en-able statistical inference (such as the integral of alog Gaussian process). For popular tasks, suchas spatial interpolation, πVAE achieves state-of-the-art performance both in terms of accu-racy and computational efficiency. Perhaps mostusefully, we demonstrate that the low dimen-sional independently distributed latent space rep-resentation learnt provides an elegant and scal-able means of performing Bayesian inferencefor stochastic processes within probabilistic pro-gramming languages such as Stan.

1. IntroductionA central task in machine learning is to specify a func-tion or set of functions that best generalises to new data.Stochastic processes (Ross, 1996) provide a mathemati-cally elegant way to define a class of functions, where eachelement from a stochastic process is a (usually infinite) col-lection of random variables. Popular examples of stochas-

*Equal contribution 1School of Public HealthImperial College, London 2Department of MathematicsImperial College London. Correspondence to: Swapnil Mishra<[email protected]>.

Preliminary work. Under review by the International Conferenceon Machine Learning (ICML). Copyright 2020 by the author(s).

tic processes in machine learning are Gaussian processes(Rasmussen & Williams, 2006), Dirichlet processes (An-toniak, 1974), log Gaussian-Cox processes (Mller et al.,1998), Hawkes processes (Hawkes, 1971), Mondrian pro-cesses (Roy & Teh, 2009) and Gauss-Markov processes(Lindgren et al., 2011). Many of these processes are in-timately connected with popular techniques in deep learn-ing, for example, the infinite width limit of a single layerneural network and the evolution of a deep neural networkby gradient descent are Gaussian processes (Jacot et al.,2018; Neal, 1996). However, while stochastic processeshave many favourable properties, they are often cumber-some to work with in practice. For example, inferenceand prediction using a Gaussian process requires matrixinversions that scale cubicly with data size, log GaussianCox processes require the evaluation of of an intractableintegral and Markov processes are often highly correlated.Bayesian inference can be even more challenging due com-plex high dimensional posterior topologies. Gold standardevaluation of posterior expectations is done by MarkovChain Monte Carlo (MCMC) sampling, but high auto-correlation, narrow typical sets (Betancourt et al., 2017)and poor scalability have prevented use in big data/modelsettings. A plethora of approximation algorithms exist(Minka, 2001; Ritter et al., 2018; Lakshminarayanan et al.,2017; Welling & Teh, 2011; Blundell et al., 2015), butfew have asymptotic guarantees and, more troubling, evenfewer actually yield accurate posterior estimates (Yao et al.,2018; Huggins et al., 2019; Hoffman et al., 2013; Yao et al.,2019). In this paper, rather than relying on approximateBayesian inference to solve complex models, we extendvariational autoencoders (VAE) (Kingma & Welling, 2014;Rezende et al., 2014) to develop portable models that canwork with state-of-the-art Bayesian MCMC software suchas Stan (Carpenter et al., 2017). Inference on the resultingmodels is tractable and yields accurate posterior expecta-tions and uncertainty.

An autoencoder (Hinton & Salakhutdinov, 2006) is a modelcomprised of two component networks η: The encoderηe : X → Z and a decoder ηd : Z → X where the la-tent space Z is typically of much lower dimension than X .The autoencoder is then learnt through the minimisation ofa loss function L. A VAE extends the autoencoder into agenerative model (Kingma & Welling, 2014). In a VAE, the

arX

iv:2

002.

0687

3v1

[cs

.LG

] 1

7 Fe

b 20

20


latent variablesZ are independent normally distributed anda variational approximation of the posterior is estimated. Ina variety of applications, VAEs fit well to the trained dataand enable the generation of new data by sampling from thelatent space; a mechanism that makes it a popular tool forprobabilistic modelling (Kingma & Welling, 2019). In thispaper we propose a novel use of VAEs: we learn low di-mensional representations of samples from a given functionclass (e.g. sample paths from a Gaussian process prior). Wethen use the resulting low dimensional representation anddecoder to perform Bayesian inference.

One key benefit of this approach is that we decouple theprior from inference to encode arbitrarily complex priorfunction classes, without needing to calculate any data like-lihoods. A second key benefit is that when inference is per-formed, our sampler operates in a low dimensional, uncor-related latent space which greatly aids efficiency and com-putation. One limitation of this approach is that we arerestricted to encoding finite-dimensional priors, becauseVAEs are not stochastic processes. To overcome this lim-itation we introduce a new VAE called the prior encod-ing VAE (πVAE) that satisfies exchangeability and Kol-mogorov consistency and thus is thus a valid stochastic pro-cess by the Kolmogorov extension theorem.

The most closely related approach to ours is the neural pro-cess (Gamelo et al., 2018) which, similar to Gaussian pro-cesses, tries to learn a distribution over functions.The neu-ral process is a stochastic process that fits to data whilesimultaneously learning the function class. In functionalform, it is similar to a VAE except, in neural processes, theKL divergence penalty term is the divergence from train-ing and testing latent variables (as opposed to a standardnormal in a VAE) and an aggregation step is included to al-low for exchangeability and consistency. In practice, neuralprocesses often learn functions that substantially underfitand therefore tend to predict both point estimates and un-certainty inaccurately (Kim et al., 2019). In contrast to theneural process, we employ a two step approach of disasso-ciating prior encoding from the inference step.

Once a πVAE is trained and defined the complexity ofthe decoder scales linearly in the size of the largest hid-den layer. Additionally, because the latent variables arepenalised via the KL term from deviating from standardGaussians, the latent space is approximately uncorrelated,leading to high effective sample sizes in MCMC sampling.The main contributions of this paper are:

• We apply the current generative framework of VAEs toperform full Bayesian inference by first encoding priorsin training and then, given new data, running inferenceon the latent representation while keeping the traineddecoder fixed.• We propose a new generative model, πVAE, that further

generalizes VAEs to be able to learn both priors over afunctions and properties of functions. We show thatπVAE satisfies the Kolmogorov extension theorem andis therefore a stochastic process.• Finally, we show the performance of πVAE on a

range of simulated and real data, and show that πVAEachieves state-of-the-art performance in a spatial inter-polation task.

The rest of this paper is structured as follows. Section 2details the proposed framework and the generative modelalong with toy fitting examples. The experiments on largereal world datasets are outlined in Section 3. We discussour findings and conclude in Section 4.

2. MethodsIn this section we first introduce the basics of VAE and pro-pose a way to extend them for learning priors over randomvectors in Section 2.1. Next in Section 2.2 we introduce anovel VAE framework, πVAE, which is a stochastic pro-cess and encodes priors over a function class.

2.1. Variational Autoencoders (VAEs)

A standard VAE has three components, an encoder network(e(·)) with parameters ηe, a latent space Z , and a decodernetwork (d(·)) with parameters ηd. In many settings, givenan input, x ∈ R (e.g. a flattened image or discrete time se-ries), e(ηe, x) and d(ηd,Z) are simply multilayer percep-trons, but can be more complicated and include convolutionor recurrent layers. The output of the encoder network Zis a latent representation of x in a lower-dimensional Gaus-sian probabilistic space. The decoder network takes thislatent space and tries to reconstruct the input by producingx. Putting both the encoder and decoder together results inthe following model

[zµ, zsd]> = e(ηe, x)

Z ∼ N (zµ, z2sdI)

x = d(ηd,Z) (1)

End to end, equation 1 can be compactly expressed asx = d(ηd, e(ηe, x)). To train a VAE, a variational ap-proximation is used to estimate the posterior distributionp(Z|x, ηd, ηe) ∝ p(x|Z, ηd, ηe)p(Z). The variationalapproximation greatly simplifies inference by turning amarginalisation problem into an optimisation problem. Theoptimal parameters for the encoder and decoder are foundby maximising the evidence lower bound:

arg maxηe,ηd

p(x|Z, ηd, ηe)− KL (Z‖N (0, I)) (2)

The first term in equation 2 is the likelihood quantifyinghow well x matches x. The second term is a Kullback-


0.0 0.2 0.4 0.6 0.8 1.0

2

1

0

1

2

(a)

0.0 0.2 0.4 0.6 0.8 1.0

1

0

1

2

True

Posterior mean

Observations

95% Credible Interval

(b)

Figure 1: Learning functions with VAE: (a) Prior samples from a VAE trained on Gaussian process samples (b) we fit ourVAE model to data drawn from a GP (blue) plus noise (black points). The posterior mean of our model is in red with the95% epistemic credible intervals shown in purple.

Leibler divergence penalty to ensure that Z is as similar aspossible to the prior distribution, a standard normal.

Once fully trained on a set of samples, we fix ηd, and usethe decoder as a generative model. To simplify subsequentnotation we refer to a fully trained decoder as d and whenevaluated as d(·). Generating a new sample vector v is sim-ple: First draw z ∼ N (zµ, z

2sdI) and then perform a deter-

ministic transformation via the decoder v = d(z).

VAEs have been typically used in literature to create orlearn a generative model of observed data (Kingma & Ba,2014). However, here we introduce a novel application ofVAEs to use them for Bayesian inference on new data afterlearning a prior over random vectors. In this role we firsttrain a VAE on random vectors from an accessible spacesuch as a reproducing kernel Hilbert space (samples froma Gaussian process). Training a VAE embeds these vec-tors in a lower dimensional probabilistic space. Once fullytrained, sampling from this lower dimensional space andtransforming via the decoder would then be able to gener-ate new random vectors. This decoder end of the VAE cantherefore be used to perform inference on a new “data” vec-tor, y, where the parameters of the decoder are fixed, andthe latent space is trainable. In this inferential scheme theunnormalised posterior distribution of Z is:

p(Z|y, d) ∝ p(y|d,Z)p(Z) (3)

In the standard VAE approach introduced above, we have:

p(zµ, zsd|y, d) ∝ p(y|d, zµ, zsd)N (zµ, z2sdI)

Note that the observation model p(y|d,Z) in equation (3)could be generalised to include an additional parameter toaccount for observation error. It is useful to contrast theinference task from equation (3) to a Bayesian neural net-work (BNN) (Neal, 1996) or Gaussian process in primalform (Rahimi & Recht, 2008). In a BNN, with weights and

biases ω, with prior hyperparameters λ, the unnormalisedposterior would be

p(ω, λ|s, y) ∝ p(y|s, ω, λ)p(ω|λ)p(λ) (4)

The key difference between equations (4) and (3) is theterm p(ω|λ). ω is typically huge, sometimes in the mil-lions, and is conditional on λ, where as in equation (3) thelatent dimension of Z is typically small (< 50), uncorre-lated and unconditioned. Given the sizes of ω full batchMCMC training is difficult and approximation algorithmstend to poorly capture the complex posterior (Yao et al.,2019; 2018). Additionally, ω tends to be highly corre-lated, making efficient MCMC nearly impossible. Finally,as the dimension and depth increases the posterior distribu-tion suffers from complex multi-modality, and concentra-tion to a narrow typical set (Betancourt et al., 2017). Bycontrast, off-the-shelf MCMC methods like Stan (Carpen-ter et al., 2017) are very effective for equation (3), sincethey simply need to sample from a multidimensional Gaus-sian distribution, while the complexity is accounted for bythe deterministic decoder.

An example of using VAEs to perform inference is shownin Figure 1 where we train a VAE (Z = 10) on sam-ples drawn from a zero mean Gaussian process with RBFkernel (K(δ) = e−δ

2/82

). We drew 105 samples of sizen = 200 over an equally spaced grid and trained a VAE for100 epochs with batch size 500. The trained weights fromthe resulting decoder were then extracted and embeddedwithin the probabilistic programming language Stan (Car-penter et al., 2017). We then sampled a new function fromour GP with noise y = GP (0,K) + N (0, 0.2) and per-formed inference on only the low dimensional latent spaceZ while keeping ηd fixed. As Figure 1 shows, we closelyrecover the true function, but also correctly estimate theposterior data noise parameter. Most crucially, our MCMCsamples showed virtually no autocorrelation, and all diag-nostic checks such as R and simulation-Based Calibration


(Gabry et al., 2019) were excellent, detailed results in Ap-pendix (Anonymous, 2020). Solving the equivalent prob-lem using a Gaussian process prior would not only be con-siderably more expensive (O(n3)) but correlation in the pa-rameter space would complicate MCMC sampling and ne-cessitate long chains with large amounts of thinning.

While this simple example might seem useful, inferenceand prediction using a VAEs will not be possible if newinput locations are required, or if the input locations arepermuted. These limitations occur because a VAE isnot a valid stochastic process in the sense of the Kol-mogorov extension theorem. The Kolmogorov exten-sion theorem guarantees that a suitably consistent and ex-changeable collection of finite-dimensional distributionswill define a stochastic process. Specifically, exchangeabil-ity means, for a permutation function π(·), p(y1:n|x1:n) =p(yπ(1:n)|xπ(1:n)) and Kolmogorov consistency means thatthat probability defined on an observed set of points isthe same as the probability defined on an extended se-quence of points when marginalised i.e. p(y1:n|x1:n) =∫p(y1:n+m|x1:n+m) dp(yn+1:n+m). A VAE does not sat-

isfy either of these conditions, as can be seen by the thelack of conditioning on input locations in equation (3). Inshort, in a VAE the choice of input locations needs to befixed ahead of time so no extrapolation is possible.

To overcome these problems we introduce a new extensionto VAE called πVAE that is capable of learning a stochasticprocess class of functions.

2.2. The stochastic process prior encoding VariationalAutoencoder - πVAE

To create a VAE with the ability perform inference on awide range of problems we have to ensure exchangeabil-ity and Kolmogorov consistency. Previous attempts to dothis have relied on introducing an aggregation (typicallyan average) to create an order invariate global distribution(Gamelo et al., 2018). However, as noted by (Kim et al.,2019), this can lead to underfitting. For πVAE we learn afeature mapping Φ over the input space. This ensures thatπVAE is a valid stochastic process (see theorem 1 and the-orem 2).

Consider a probability space (Ω,F , P ) where Ω is a sam-ple space, F is a σ-algebra and P is a probability mea-sure. We can define a set of locations S, e.g. for timeseries this is a set of time labels, and for spatial dataS = R2 = (latitude, longitude. Consider a real-valuedstochastic process denoted as: Z(s, ω) : s ∈ S, ω ∈ Ω

To understand this process, consider fixing each argumentin turn. For a fixed si ∈ S, Z(si, ·) : Ω → R is a real-valued random variable, mapping from the sample space Ωto R. If we instead fix a particular event from the sample

space ωi ∈ Ω then Z(·, ωi) : S → R is a realisation (alsoknown as a sample path) of the stochastic process, assign-ing a real value to every location. Here we use (s, x) todenote any arbitrary pair of (input,output) locations.

To convert a VAE into a stochastic process, instead of di-rectly learning mappings of function values at a fixed setof locations we first learn a feature map and then a linearmapping from this feature map to the observed values ofthe function. The autoencoder learns a latent probabilis-tic representation over the linear mappings. These two in-gredients correspond to the inputs to the stochastic processZ(s, ω): the feature map means that we can evaluate ourstochastic process at any s ∈ S and the latent space corre-sponds to the sample space Ω.

Let us assume, without loss of generality, we have N func-tion draws where each function is evaluated at K differ-ent locations1. In the encoder we transform each observedlocation for a function ski , where i ∈ N and k ∈ K,to a fixed feature space through a transformation functionΦ(ski ) that is shared for all s ∈ S locations across all Nfunctions. Φ(s) could be an explicit feature representationfor an RKHS (e.g. an RBF network or a random Fourierfeature basis (Rahimi & Recht, 2008)), a neural networkof arbitrary construction or, as we use in the examples inthis paper, a combination of both. Following this transfor-mation, a linear basis β is used to predict function evalu-ations at an arbitrary set of locations s. The intuition be-hind these two transformations is to learn the associationbetween locations and observations. In contrast to a stan-dard VAE encoder that takes as input function evaluations,[zµ, zsd]

> = e(ηe, xi) and encodes these to a latent space,πVAE encoder first transforms the locations to a higher di-mensional feature space via Φ, and then connects this fea-ture space to outputs, x, through a linear mapping, β. TheπVAE decoder takes Z from the encoder, and attempts torecreate β from a lower dimensional probabilistic embed-ding. This recreation, β, is then used as a linear mappingwith the same Φ to get a reconstruction of the outputs x. Itis crucial to note, that a single β vector is not learnt. In-stead for each function in i ∈ N a βi is learnt. End-to-endπVAE is:

xke,i = β>i Φ(ski )

[zµ, zsd]> = e(ηe, βi)

Z ∼ N (zµ, z2sdI)

βi = d(ηd,Z)

xkd,i = β>i Φ(ski ) (5)

It can be seen from equation 5 that the central task is not,

1The number of evaluation across each function can change inour model, all equations would still hold if K is conditioned onn ∈ N to be Kn.


as in the VAE, to recreate the input (note x in both ends ofπVAE) but to reconstruct the linear parameters β that canmap a shared Φ onto the inputs. End to end learning is per-formed in πVAE to estimate parameters for ηe, ηd, Φ andβ. The evidence lower bound to be maximised is similarto that for the VAE, except the likelihood term now has anadditional term for the reconstructing via feature transfor-mation and linear mapping.

arg maxηe,ηd,Φ,βi

p(xki |βi, ski , φ, ηe) + p(xki |Z, ski , φ, ηd)−

KL (Z‖N (0, I)) (6)

When assuming a Gaussian likelihood, equation 6 can besimplified to the following loss function:

arg minηe,ηd,Φ,β

(xki − βTi Φ(ski ))2 + (xki − βiT

Φ(ski ))2+

KL (Z‖N (0, I)) (7)

In equation 7, the first term ensures a minimum square errorof the linear mapping (β) of Φ and the input. The secondterm ensures a minimum square error of the reconstructed(β) linear mapping of Φ and the input. The third term is aKL loss to ensure the latent dimension is as close toN (0, I)as possible. Once again it is important to note the indices i.A different βi is learnt for every function draw (xi, si) butthe same Φ is used. This means we need to learn ηe, ηd,Φbut also N βs, one for each sample function. While thismay seem like a huge computational task, K is typicallyquite small (< 200) and so learning can be relatively quick(dominated by matrix multiplication of hidden layers).

The πVAE can be used as a generative model. Generatinga function f is done by first, selecting input locations, s,and then drawing z ∼ N (zµ, z

2sdI). This sample z is then

transformed into a sample through a deterministic transfor-mation via the decoder and Φ, f(·) := d(z)>Φ(·). TheπVAE can be used for inference on new data pairs (sj , yj),where the unnormalised posterior distribution is

p(Z|d, yj , sj ,Φ) ∝ p(yj |d, sj ,Z,Φ)p(Z) (8)

p(zµ, zsd|d, yj , sj ,Φ) ∝ p(yj |d, sj , zµ, zsd,Φ)N (zµ, z2sdI)

We note, (sj , yj) is a set of observed data, i.e., it is a col-lection of (input, output) locations from a new unknownfunction that is to be inferred. Further, the distinguishingdifference between equation (3) and equation (8) is condi-tioning on input locations and Φ. It is Φ that ensures πVAEis a valid stochastic process. We formally prove this below:

Theorem 1 (Exchangeability) Consider a permutationfunction π, locations s1, . . . , sn, fixed and bounded func-tions Φ and d, and a multivariate Gaussian random vari-able Z . Together, these define the random function f(·) =

d(Z)>Φ(·). We claim:

p(f(s1), . . . , f(sn)) = p(f(sπ(1)), . . . , f(sπ(n)))

Proof :

p(f(s1), . . . , f(sn)) = p(f(s1), . . . , f(sn)|Z)p(Z)

=

n∏i=1

p(d(Z)>Φ(si)|Z)p(Z)

= p(f(sπ(1), . . . , f(sπ(n))|Z)p(Z)

= p(f(sπ(1)), . . . , f(sπ(n)))

Where the second line follows by the definition of f .

Theorem 2 (Consistency) Under the same conditions asTheorem 1, and given an extra location sn+1 we claim:

p(f(s1), . . . , f(sn)) =

∫p(f(s1), . . . , f(sn+1))df(sn+1)

Proof :∫p(f(s1), . . . , f(sn+1))df(sn+1)

=

∫p(f(s1), . . . , f(sn+1)|Z)p(Z)df(sn+1)

=

∫ n∏i=1

p(d(Z)>Φ(si)|Z)p(d(Z)>Φ(si+1)|Z)p(Z)df(sn+1)

=

n∏i=1

p(d(Z)>Φ(si)|Z)p(Z)

∫p(d(Z)>Φ(si+1)|Z)df(sn+1)

=

n∏i=1

p(d(Z)>Φ(si)|Z)p(Z) = p(f(s1), . . . , f(sn))

2.3. Examples

We first demonstrate the utility of our proposed πVAEmodel by fitting the simulated 1-D regression problem in-troduced in (Hernandez-Lobato & Adams, 2015). Thetraining points for the dataset are created by uniform sam-pling of 20 inputs, x, between (−4, 4). The correspondingoutput is set as y ∼ N (x3, 9). We fit two different variantsof πVAE, representing two different prior classes of func-tions. The first prior produces cubic monotonic functionsand the second prior is a GP with an RBF kernel and a twolayer neural network. We generated 104 different functiondraws from both priors to train the respective πVAE. Oneimportant consideration in πVAE is to chose a sufficientlyexpressive Φ, we used a RBF layer with trainable centrescoupled with two layer neural network with 20 hidden unitseach. We compare our results against 20,000 HamiltonianMonte Carlo (HMC) samples (Neal, 1993) implemented


6 4 2 0 2 4 6

200

100

0

100

200

300 True

Postreior meanObservations

95% Credible Interval

(a)

6 4 2 0 2 4 6

200

100

0

100

200

300

(b)

6 4 2 0 2 4 6

200

100

0

100

200

(c)

Figure 2: The true underlying function is y ∼ N (x3, 9). (a) πVAE trained on a class of cubic functions, (b) πVAE trainedon samples from a Gaussian process with RBF kernel and (c) is a Gaussian process with RBF kernel fit with HMC

0 10 20 30 40 50 60Time Points

0

50

100

150

200

250

300

Inte

nsi

ty

True Intensity

Estimated Intensity

Observed Locations

Train Data Test Data

(a)

0 10 20 30 40 50 60Time Points

0

2000

4000

6000

8000

10000

Inte

gra

l In

tensi

ty

Integral True Intensity

Integral Estimated Intensity

Train Data Test Data

(b)

Figure 3: Inferring the intensity of a log Gaussian Cox Process. (a) compares the posterior distribution of the intensityestimated by πVAE to the true intensity function on train and test data. (b) compares the posterior mean of the cumulativeintegral over time estimated by πVAE to the true cumulative integral on train and test data.

using Stan (Carpenter et al., 2017). Details for the im-plementation for all the models can be found in the Ap-pendix (Anonymous, 2020).

Figure 2(a) presents results for πVAE with a cubic prior,Figure 2(b) with the GP+NN prior, and Figure 2(c) fora standard RBF + HMC. Both the mean estimates andthe uncertainty from πVAE variants are closer, and moreconstrained than the ones using HMC. Importantly, πVAEwith cubic prior not only produces better point estimatesbut is the able to capture better uncertainty bounds. Thisdemonstrates that πVAE can be used to incorporate domainknowledge about the functions being modelled.

In many scenarios, learning just the mapping of inputs tooutputs is not sufficient as other functional properties arerequired to perform useful (interesting) analysis. For ex-ample, using point processes requires knowing the under-lying intensity function, however, to perform inference weneed to calculate the integral of that intensity function too.Calculating this integral, even in known analytical form, isvery expensive. Hence, in order to circumvent the issuewe use πVAE to learn both function values and its integralfor the observed events. Figure 3 shows πVAE predictionfor both the intensity and integral of a simulated 1-D Log

Gaussian Cox Process (LGCP).

In order to train πVAE to learn the function space of 1-D LGCP functions, we first create a training set by draw-ing 10,000 different samples of intensity function using anRBF kernel for 1-D LGCP. For each of the drawn intensityfunction we choose an appropriate time horizon to sample80 observed events (locations) from the intensity function.The πVAE is trained on the sampled 80 locations with theircorresponding intensity and the integral. The πVAE there-fore outputs both the instantaneous intensity and the inte-gral of the intensity. The implementation details can beseen in the Appendix (Anonymous, 2020). For testing, wefirst draw a new intensity function (1-D LGCP) using thesame mechanism used in training and sample 100 events(locations). As seen in Figure 3 our estimated intensityis very close to true intensity and even the estimated in-tegral is close to the true integral. This example shows thatthe πVAE approach can be used to learn not only functionevaluations but properties of functions.


(a) (b) (c) (d)

Figure 4: Deviation in land surface temperature for East Africa trained on 6000 random uniformly chosen locations (Tonet al., 2018). Plot (a) is the actual data, (b) is our πVAE approach (testing MSE: 0.38), (c) Is a full rank GP with Matern32 kernel (testing MSE: 2.47) and (d) Is a low rank SPDE approximation with 1953 basis functions (Lindgren et al., 2011)and a Matern 3

2 kernel (testing MSE: 2.91). The πVAE not only has a substantially lower test error but captures fine scalefeatures much better than a Gaussian process.

3. ResultsHere we show applications of πVAE on three real worlddatasets. In our first example we use πVAE to predict thethe deviation in land surface temperature in East Africa(Ton et al., 2018). We have the deviation in land surfacetemperatures for ∼89,000 locations across East Africa.Our training data consisted of 6,000 uniformly sampledlocations. Temperature was predicted using only the spa-tial locations as inputs. Figure 4 shows the results of theground truth (a), our πVAE (b), a full rank Gaussian pro-cess with Matern kernel (Gardner et al., 2018), and lowrank Gauss Markov random field (a widely used approachin the field of geostatistics) with 1, 046 ( 1

6 th of the train-ing size) basis functions (Lindgren et al., 2011; Rue et al.,2009). We train our πVAE model on 107 functions drawsfrom 2-D GP with small lengthscales between 10−5 to 2.Φ was set to be a Matern layer with 1,000 centres fol-lowed by a two layer neural network 100 hidden units. TheπVAE latent dimension was set to 20. As seen in Figure 4,πVAE is able to captures small scale features and producesa far better reconstruction than the both full and low rankGP and despite having a much smaller latent dimensionof 20 vs 6,000 (full) vs 1,046 (low). The testing error issubstantially better than the full rank GP which begs thequestion, why does πVAE perform so much better than aGP, despite being trained on samples from a GP. One pos-sible reason is that the extra hidden layers in Φ create amuch richer structure that could capture elements of non-stationarity (Ton et al., 2018). Alternatively, the ability touse state-of-the-art MCMC and estimate a reliable poste-rior expectation might create resilience to overfitting. Thetraining/testing error for πVAE is 0.07/0.38, while the full

Method RMSE NLL

Full rank GP 0.099 -0.258πVAE 0.112 0.006SGPR (m = 512) 0.273 0.087SVGP (m = 1024) 0.268 0.236

Table 1: Test results for πVAE, a full rank GP and twoapproximate algorithms SGPR and SVGP on Kin40K.

rank GP is 0.002/2.47. Therefore the training error is 37times smaller in the GP, but the testing error is only 6 timessmaller in πVAE suggesting that, despite marginalisation,the GP is still overfitting.

Table 1 compares πVAE on the Kin40K (Schwaighofer &Tresp, 2003) dataset to state-of-the-art full and approximateGPs, with results taken from Wang et al. (2019). The ob-jective is to predict the distance of a robotic arm from thetarget given the position of all 8 links present on the roboticarm. In total we have 40,000 samples which are dividedrandomly into 2

3 training samples and 13 test samples. We

train πVAE on 107 functions drawn from an 8-D GP, ob-served at 200 locations, where each of the 8 dimensionshad values drawn uniformly from the range (−2, 2) andlengthscale varied between 10−3 to 10. Once πVAE wastrained on the prior function we use it to infer the posteriordistribution for the training examples in Kin40K. Table 1shows results for RMSE and negative log-likelihood (NLL)of πVAE against various GP methods on test samples. Thefull rank GP results reported in Wang et al. (2019) arebetter than those we obtain from πVAE, but we are com-petitive, and far better than the approximate GP methods.We also note here that the exact GP is based on maximum


10% pixels 20% pixels 30% pixels

(a) Sample 1 (b) Sample 1 (c) Sample 1

(d) Sample 2 (e) Sample 2 (f) Sample 2

(g) Sample 3 (h) Sample 3 (i) Sample 3

Figure 5: MNIST reconstruction after observing only 10,20 or 30% of pixels from original data.

marginal likelihood while πVAE performs full Bayesian in-ference; all posterior checks were excellent showing cali-bration both in uncertainty and point estimates. For de-tailed diagnostics see the Appendix.

Finally we apply πVAE to the task of reconstructingMNIST digits, except from partial observations. Similarto the earlier temperature prediction task image completioncan also be seen as a regression task in 2-D. The regressiontask is to predict the intensity of pixels given the pixel lo-cations. We first train our πVAE on MNIST digits dataset(not partial) with 40 as the dimension for the latent dimen-sion. As with previous example the decoder and encodernetworks are made up of two layer neural networks. Thehidden units for encoder are 256 and 128 for first and sec-ond layer respectively. For decoder the architecture of en-coder is reversed.

Once we have trained πVAE with MNIST images in trainset we now use images from test set for prediction. Im-ages in testing set are sampled in such a way that only 10,20 or 30% of pixel values are first shown to πVAE to fixthe parameter for a specific test function and then πVAEis used to predict the intensity at all other pixel locations.As seen from Figure 5 the performance of πVAE increaseswith increase in pixel locations available during predictionbut still even with 10% pixels our model is able to learna decent approximation of the image. The uncertainty isprediction can be seen from the different samples producedby the model for the same data. As number of given loca-

tion increases the variance between samples decreases withquality of image also increasing.

4. Discussion and ConclusionIn this paper we have proposed a novel VAE formulationof a stochastic process, with the ability to learn functionclasses and properties of functions. Our πVAEs typicallyhave a small, uncorrelated latent dimension (5-50) of pa-rameters, so Bayesian inference with MCMC is straightfor-ward and highly effective at successfully exploring the pos-terior distribution. The accurate estimation of uncertaintyis essential in many areas such as medical decision-making.

While approximate Bayesian inference has seen wide ap-plication in recent years, the accuracy of these posteriorexpectations is generally poor (Yao et al., 2018) (or un-known) when benchmarked against MCMC approaches.However, MCMC approaches are often unavailable in real-world problems due to poor scalability and difficulties aris-ing from complex posteriors.

πVAE combines the power of deep learning to create highcapacity function classes, while ensuring tractable infer-ence using fully Bayesian MCMC approaches. Our 1-Dexample in Figure 2 demonstrates that an exciting use ofπVAE is to incorporate domain knowledge about the prob-lem. Monotonicity or complicated dynamics can be en-coded directly into the prior (Caterini et al., 2018) on whichπVAE is trained. Our log-Gaussian Cox Process exampleshows that not only functions can me modelled, but alsoproperties of functions such as integrals. Perhaps the mostsurprising result is the performance of the πVAE on spa-tial interpolation. Despite being trained on samples froma Gaussian process, πVAE substantially outperforms a fullrank GP. We conjecture this is due to the more complexstructure of the feature representation Φ and due to a re-silience to overfitting. The MNIST example highlights howπVAE can used as a generative model on a much smallersubset of training data.

There are costs to using πVAE, especially the large up-front cost in training. For complex priors, training couldtake days or weeks and will invariably require heuristicsand parameter search inherent in applied deep learning toachieve good performance. However, once trained, a πVAEnetwork is applicable on a wide range of problems, withthe Bayesian inference MCMC step taking seconds or min-utes. We will make available pretrained networks for vari-ous popular priors.

Future work should investigate the performance of πVAEon higher dimensional settings. Other stochastic processes,such as Dirichlet processes, could be considered.


ReferencesAnonymous. πVAE: Encoding stochastic process priors

with variational autoencoders (Appendix). In Proceed-ings of the 37th International Conference on MachineLearning, ICML 2020, 2020.

Antoniak, C. E. Mixtures of dirichlet processes with appli-cations to bayesian nonparametric problems. The annalsof statistics, pp. 1152–1174, 1974.

Betancourt, M., Byrne, S., Livingstone, S., and Girolami,M. The geometric foundations of Hamiltonian MonteCarlo. Bernoulli, 2017. ISSN 13507265. doi: 10.3150/16-BEJ810.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,D. Weight uncertainty in neural networks. In 32nd Inter-national Conference on Machine Learning, ICML 2015,2015. ISBN 9781510810587.

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D.,Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li,P., and Riddell, A. Stan : A Probabilistic ProgrammingLanguage. Journal of Statistical Software, 76(1):1–32,2017. ISSN 1548-7660. doi: 10.18637/jss.v076.i01.

Caterini, A. L., Doucet, A., and Sejdinovic, D. Hamil-tonian variational auto-encoder. In Advances in NeuralInformation Processing Systems, 2018.

Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., andGelman, A. Visualization in Bayesian workflow. Journalof the Royal Statistical Society. Series A: Statistics in So-ciety, 2019. ISSN 1467985X. doi: 10.1111/rssa.12378.

Gamelo, M., Rosenbaum, D., Maddison, C. J., Ramalho,T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D. J.,and Eslami, S. M. Conditional neural processes. In 35thInternational Conference on Machine Learning, ICML2018, 2018. ISBN 9781510867963.

Gardner, J. R., Pleiss, G., Bindel, D., Weinberger, K. Q.,and Wilson, A. G. Gpytorch: Blackbox matrix-matrixGaussian process inference with GPU acceleration. InAdvances in Neural Information Processing Systems,2018.

Hawkes, A. G. Spectra of some self-exciting and mutu-ally exciting point processes. Biometrika, 58(1):83–90,1971. ISSN 00063444. doi: 10.1093/biomet/58.1.83.

Hernandez-Lobato, J. M. and Adams, R. Probabilisticbackpropagation for scalable learning of bayesian neu-ral networks. In International Conference on MachineLearning, pp. 1861–1869, 2015.

Hinton, G. E. and Salakhutdinov, R. R. Reducing the di-mensionality of data with neural networks. science, 313(5786):504–507, 2006.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J.Stochastic variational inference. The Journal of MachineLearning Research, 14(1):1303–1347, 2013.

Huggins, J. H., Kasprzak, M., Campbell, T., and Broder-ick, T. Practical posterior error bounds from variationalobjectives. arXiv preprint arXiv:1910.04102, 2019.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent ker-nel: Convergence and generalization in neural networks.In Advances in Neural Information Processing Systems,2018.

Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, S.M. A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. At-tentive Neural Processes. CoRR, abs/1901.0, 2019.

Kingma, D. P. and Ba, J. Adam: A Method for StochasticOptimization. arXiv preprint arXiv:1412.6980, 12 2014.

Kingma, D. P. and Welling, M. Auto-Encoding Vari-ational Bayes (VAE, reparameterization trick). ICLR2014, 2014.

Kingma, D. P. and Welling, M. An introduction to varia-tional autoencoders. Foundations and Trends R© in Ma-chine Learning, 12(4):307–392, 2019.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Sim-ple and scalable predictive uncertainty estimation usingdeep ensembles. In Advances in Neural Information Pro-cessing Systems, 2017.

Lindgren, F., Rue, H., and Lindstrom, J. An explicit linkbetween Gaussian fields and Gaussian Markov randomfields: the stochastic partial differential equation ap-proach. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology), 73(4):423–498, 2011. ISSN13697412. doi: 10.1111/j.1467-9868.2011.00777.x.

Minka, T. P. Expectation propagation for approximatebayesian inference. In Proceedings of the SeventeenthConference on Uncertainty in Artificial Intelligence,UAI01, pp. 362369, San Francisco, CA, USA, 2001.Morgan Kaufmann Publishers Inc. ISBN 1558608001.

Mller, J., Syversveen, A. R., and Waagepetersen, R. P. Loggaussian cox processes. Scandinavian Journal of Statis-tics, 25(3):451–482, 1998. doi: 10.1111/1467-9469.00115.

Neal, R. Bayesian Learning for Neural Networks.LECTURE NOTES IN STATISTICS -NEW YORK-SPRINGER VERLAG-, 1996. ISSN 0930-0325.


Neal, R. M. Probabilistic inference using Markov chainMonte Carlo methods. Department of Computer Sci-ence, University of Toronto Toronto, Ontario, Canada,1993.

Rahimi, A. and Recht, B. Random features for large-scalekernel machines. In Advances in neural information pro-cessing systems, pp. 1177–1184, 2008.

Rasmussen, C. E. and Williams, C. K. I. Gaussianprocesses for machine learning. Adaptive computa-tion and machine learning. MIT Press, 2006. ISBN026218253X. URL http://www.worldcat.org/oclc/61285753.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochasticbackpropagation and approximate inference in deep gen-erative models. In International Conference on MachineLearning, pp. 1278–1286, 2014.

Ritter, H., Botev, A., and Barber, D. A scalable laplaceapproximation for neural networks. In 6th InternationalConference on Learning Representations, ICLR 2018 -Conference Track Proceedings, 2018.

Ross, S. M. Stochastic processes, volume 2. John Wiley &Sons, 1996.

Roy, D. M. and Teh, Y. W. The Mondrian process. InAdvances in Neural Information Processing Systems 21- Proceedings of the 2008 Conference, 2009. ISBN9781605609492.

Rue, H., Martino, S., and Chopin, N. ApproximateBayesian inference for latent Gaussian models by usingintegrated nested Laplace approximations. Journal of theRoyal Statistical Society. Series B: Statistical Method-ology, 71(2):319–392, 4 2009. ISSN 13697412. doi:10.1111/j.1467-9868.2008.00700.x.

Schwaighofer, A. and Tresp, V. Transductive and inductivemethods for approximate gaussian process regression. InAdvances in Neural Information Processing Systems, pp.977–984, 2003.

Ton, J.-F., Flaxman, S., Sejdinovic, D., and Bhatt, S. Spa-tial mapping with Gaussian processes and nonstation-ary Fourier features. Spatial Statistics, 2018. ISSN22116753. doi: 10.1016/j.spasta.2018.02.002.

Wang, K., Pleiss, G., Gardner, J., Tyree, S., Weinberger,K. Q., and Wilson, A. G. Exact gaussian processes on amillion data points. In Advances in Neural InformationProcessing Systems, pp. 14622–14632, 2019.

Welling, M. and Teh, Y. W. Bayesian learning via stochas-tic gradient langevin dynamics. In Proceedings of the28th International Conference on Machine Learning,ICML 2011, 2011. ISBN 9781450306195.

Yao, J., Pan, W., Ghosh, S., and Doshi-Velez, F. Qualityof uncertainty quantification for bayesian neural networkinference. arXiv preprint arXiv:1906.09686, 2019.

Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. Yes, butdid it work?: Evaluating variational inference. In 35thInternational Conference on Machine Learning, ICML2018, 2018. ISBN 9781510867963.

http://www.worldcat.org/oclc/61285753

http://www.worldcat.org/oclc/61285753

Documents

VAE: Encoding stochastic process priors with variational ... · In theory, they provide ﬂexible priors over function classes that can encode a wide range of interesting as-